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Preface 


Read This First 


This manualis a reference for programming TMS320CE6x digital signal proces- 
sor (DSP) devices. 


Before you use this book, you should install your code generation and debug- 
ging tools. 


This book is organized in four major parts: 


Lj 


Part I: Introduction includes a brief description of the ’C6x architecture 
and code development flow. It also includes a tutorial that introduces you 
to the tools you will use in each phase of development and an optimization 
checklist to help you achieve optimal performance from your code. 


Part Il: C Code includes C code examples and discusses optimization 
methods for the code. This information can help you choose the most 
appropriate optimization techniques for your code. 


Part Ill: Assembly Code describes the structure of assembly code. It pro- 
vides examples and discusses optimizations for assembly code. It also in- 
cludes a chapter on interrupt subroutines. 


PartIV: Appendix provides extensive code examples from the GSM EFR 
vocoder. 
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The following books describe the TMS320C6x devices and related support 
tools. To obtain a copy of any of these TI documents, call the Texas Instru- 
ments Literature Response Center at (800) 477-8924. When ordering, please 
identify the book by its title and literature number. 


TMS320C6000 Assembly Language Tools User’s Guide (literature number 
SPRU186) describes the assembly language tools (assembler, linker, 
and other tools used to develop assembly language code), assembler 
directives, macros, common object file format, and symbolic debugging 
directives for the ’C6000 generation of devices. 


TMS320C6000 Optimizing C Compiler User’s Guide (literature number 
SPRU187) describes the C6000 C compiler and the assembly optimizer. 
This C compiler accepts ANSI standard C source code and produces as- 
sembly language source code for the C6000 generation of devices. The 
assembly optimizer helps you optimize your assembly code. 


TMS320C6x C Source Debugger User’s Guide (literature number 
SPRU188) tells you how to invoke the ’C6x simulator and emulator 
versions of the C source debugger interface. This book discusses 
various aspects of the debugger, including command entry, code 
execution, data management, breakpoints, profiling, and analysis. 


TMS320C6000 CPU and Instruction Set Reference Guide (literature 
number SPRU189) describes the C6000 CPU architecture, instruction 
set, pipeline, and interrupts for these digital signal processors. 


TMS320 DSP Designer’s Notebook: Volume 1 (literature number 
SPRT125) presents solutions to common design problems using ’C2x, 
’C8x, ’C4x, 'C5x, and other TI DSPs. 


TMS320C6000 Peripherals Reference Guide (literature number SPRU190) 
describes common peripherals available on the TMS320C6000 digital 
signal processors. This book includes information on the internal data 
and program memories, the external memory interface (EMIF), the host 
port interface (HPI), multichannel buffered serial ports (McBSPs), direct 
memory access (DMA), enhanced DMA (EDMA), expansion bus, clock- 
ing and phase-locked loop (PLL), and the power-down modes. 


TMS320C6201 Digital Signal Processor Data Sheet (literature number 
SPRS051) describes the features of the TMS320C6201 and provides 
pinouts, electrical specifications, and timings for the device. 
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Trademarks 
Solaris and SunOS are trademarks of Sun Microsystems, Inc. 
VelociTl is a trademark of Texas Instruments Incorporated. 


Windows and Windows NT are registered trademarks of Microsoft 
Corporation. 
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If You Need Assistance 


If You Need Assistance... 


C1 World-Wide Web Sites 


TI Online http://www.ti.com 

Semiconductor Product Information Center (PIC) —_http://www.ti.com/sc/docs/pic/home.htm 
DSP Solutions http://www.ti.com/dsps 

320 Hotline On-line ™ http:/Awww.ti.com/sc/docs/dsps/support.htm 


QO North America, South America, Central America 


Product Information Center (PIC) (972) 644-5580 
TI Literature Response Center U.S.A. (800) 477-8924 
Software Registration/Upgrades (214) 638-0333 Fax: (214) 638-7742 
U.S.A. Factory Repair/Hardware Upgrades (281) 274-2285 

U.S. Technical Training Organization (972) 644-5580 

DSP Hotline Email: dsph@ti.com 
DSP Internet BBS via anonymous ftp to ftp://ftp.ti.com/pub/tms320bbs 


(i Europe, Middle East, Africa 
European Product Information Center (EPIC) Hotlines: 


Multi-Language Support +33 130701169 Fax: +33 1307010 32 
Email: epic@ti.com 
Deutsch +49 8161 80 33 11 or +33 1 30 70 11 68 
English +33 13070 11 65 
Francais +33 130 70 11 64 
Italiano +33 1 30 70 11 67 
EPIC Modem BBS +33 1307011 99 
European Factory Repair +33 4 93 22 25 40 
Europe Customer Training Helpline Fax: +49 81 61 80 40 10 
QO Asia-Pacific 
Literature Response Center +852 2956 7288 Fax: +852 2 956 2200 
Hong Kong DSP Hotline +852 2956 7268 Fax: +852 2 956 1002 
Korea DSP Hotline +82 2551 2804 ~=Fax: +82 2551 2828 
Korea DSP Modem BBS +82 2551 2914 
Singapore DSP Hotline Fax: +65 390 7179 
Taiwan DSP Hotline +886 23771450 Fax: +886 2 377 2718 
Taiwan DSP Modem BBS +886 2 376 2592 
Taiwan DSP Internet BBS via anonymous ftp to ftp://dsp.ee.tit-edu.tw/pub/TI/ 
() Japan 
Product Information Center +0120-81-0026 (in Japan) Fax: +0120-81-0036 (in Japan) 
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Chapter 1 


Introduction 


Part I 


This chapter introduces some features of the ‘C6x microprocessor and 
discusses the basic process for creating code. Any reference to ’C6x pertains 
to both the ’C62x (fixed-point) and the 'C67x (floating-point) devices. All tech- 
niques are applicable to both devices, even though most of the examples 
shown are fixed-point specific. 
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TMS320C62xx Architecture / TMS320C62xx Pipeline 


1.1. TMS320C6x Architecture 


The ’C62x is a fixed-point digital signal processor (DSP) and is the first DSP 
to use the VelociTI™ architecture. VelociTl is a high-performance, advanced 
very-long-instruction-word (VLIW) architecture, making it an excellent choice 
for multichannel, multifunction, and performance-driven applications. 


The ’C67x is a floating-point DSP with the same features. It is the second DSP 
to use the VelociT|™ architecture. 


The ’C6x DSPs are based on the ’C6x CPU, which consists of: 


DUOUOUUUUU 


Program fetch unit 

Instruction dispatch unit 

Instruction decode unit 

Two data paths, each with four functional units 
Thirty-two 32-bit registers 

Control registers 

Control logic 

Test, emulation, and interrupt logic 


1.2 TMS320C6x Pipeline 


The 'C6x pipeline has several features that provide optimum performance, low 
cost, and simple programming. 


L 


L 


Increased pipelining eliminates traditional architectural bottlenecks in pro- 
gram fetch, data access, and multiply operations. 


Pipeline control is simplified by eliminating pipeline locks. 
The pipeline can dispatch eight parallel instructions every cycle. 


Parallel instructions proceed simultaneously through the same pipeline 
phases. 


Code Development Flow to Increase Performance 


1.3 Code Development Flow to Increase Performance 


You can achieve the best performance from your ’C6x code if you follow this 
flow when you are writing and debugging your code: 


Phase 1: Write C code 
Develop C Code 


Phase 2: 
Refine C Code 


Ss 
Complete 


optimization? 


po Write linear assembly 
Phase 3: 


Write Linear 
Assembly Assembly optimize 


No 
: <at> 
Yes 


Complete 
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Code Development Flow to Increase Performance 


The following lists the phases in the 3-step software development flow shown 
on page 1-3, and the goal for each phase: 


Phase Goal 


{ 


You can develop your C code for phase 1 without any knowledge of 
the ’C6x. Use the ’C6x profiling tools that are described in the 
TMS320C6x C Source Debugger User's Guide to identify any ineffi- 
cient areas that you might have in your C code. To improve the per- 
formance of your code, proceed to phase 2. 


Use the intrinsics, shell options, and techniques that are described 
in Chapter 4 of this book to improve your C code. Use the ’C6x profil- 
ing tools to check its performance. If your code is still not as efficient 
as you would like it to be, proceed to phase 3. 


Extract the time-critical areas from your C code and rewrite the code 
in linear assembly. You can use the assembly optimizer to optimize 
this code. 
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Code Development Flow Tutorial 


Part I 


This chapter walks you through the code development flow that was 
introduced in Chapter 1. It uses step-by-step instructions and code examples 
to show you how to use the software development tools in each phase of devel- 
opment. 


Before you start this tutorial, you should install the code generation tools and 
the C source debugger. If you do not have a Texas Instruments C source de- 
bugger, use your own debugger to check your results. 


The sample code that is used in this tutorial is included on the code generation 
tools CD-ROM. When you install your code generation tools, the example 
code is installed in the c6xtools directory. Use the code in that directory to go 
through the examples in this chapter. 


The examples in this chapter were run on the most recent version of the soft- 
ware development tools that were available as of the publication of this book. 
Because the tools are being continuously improved, you may get different re- 
sults if you are using a more recent version of the tools. 
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Before You Begin 


2.1 Before You Begin 
This tutorial contains three basic types of information: 


Primary tasks Primary tasks identify the main lessons in the 
tutorial; they are boxed so that you can find 
them easily. A primary task looks like this: 


On a command line, enter: 


load6x count.out 


Important information In addition to primary actions, important infor- 
mation ensures that the tutorial works correctly. 
Important information is marked like this: 


| Important! | If you are using SunOS, be sure 


you reinitialize your shell before continuing with 
this tutorial. 


Optional tasks Optional tasks allow you to learn more about 
the ’C6x tools; however, you do not need to per- 
form the optional tasks to complete the tutorial 
successfully. Optional tasks are marked like 
this: 


Try This: | The stand-alone simulator (load6x) 
is another tool that you can use to find out what 
the cycle count for each function is. 


This tutorial is divided into lessons. Each lesson builds on the previous lesson. 
To get the most benefit from the tutorial, you should start at the beginning and 
work your way through each lesson in order to the end. 
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2.2 Introduction to the Example Code 
The C code example that you will use to start this tutorial is demo1.c, which 


is shown in Example 2-1. This example calls three functions: mac1(), 
vec_mpy1(), and iir1( ). 


Example 2-1. The Code Example—demo1.c 


main(int argc, char *argv[]) 


{ 
const short coefs[150]; 
short optr[150]; 
short state[2]; 
const short a[150]; 
const short b[150]; 
int c = 0; 
int dotp[1] = {0}; 
int sum= 0; 
short y[150]; 
short scalar = 3345; 
const short x[150]; 


sum = macl(a, b, c, dotp); 
vec_mpyl(y, x, scalar); 
iirl(coefs, x, optr, state); 


The mac1( ) function, a multiply accumulate and squaring accumulate exam- 
ple, is shown in Example 2-2. It is performing a dot product of vector a with 
vector b and is also squaring and summing vector b. 


Example 2-2. The Multiply Accumulate Function—mact1.c 


int macl(const short *a, const short *b, int sqr, int *sum) 
{ 

int i; 

int dotp = *sum; 


for (i = 0; i < 150; i++) 
{ 
dotp += b[i] * aflil; 
sqr += b[i] * b[il; 
} 


*sum = dotp; 
FEturn sar; 
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Introduction to the Example Code 


The vec_mpy( ) function shown in Example 2-3 is a vector multiply, which is 
a scalar multiply followed by a right shift. The result is stored to a second vec- 
tor. 


Example 2-3. The Vector Multiply Function—vec_mpy1.c 


void vec_mpyl (short y[], const short x[], short scalar) 
{ 


int. 17 


for (i = 0; i < 150; itt) 
yli]l += ((scalar * x[i]) >> 15); 


The third function, iir1( ), is atypical infinite impulse response (IIR) biquad filter. 
The code for this function is shown in Example 2-4. 


Example 2-4. The Biquad Filter—iir1.c 


{ 


short x; 
short tf 

int n; 

x = input[0]; 


for (n = 0; n < 50; n++) 


t = x + ((coefs[2] 
coefs[3] 


x = t + ((coefs[0] * 
coefs[1] 


state[1] = state[0]; 


} 


*optr++ = x; 


void iirl(const short *coefs, const short *input, 
short *optr, short *state) 


* state[0] + 
* state[1]) >> 15); 


state[0] + 


* state[1]) >> 15); 


state[0] = t; 
coefs += 4; /* point to next filter coefs */ 
state += 2; /* point to next filter states */ 
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2.3 Lesson 1: Compiling, Assembling, and Linking the Example Code 


The first step is to compile, assemble, and link the code. 


Compiling for the ’C62x: 
On a command line, enter the following on a single line: 


cl6x -g -o -k -mg demol.c macl.c vec_mpyl.c iirl.c 
-z Ink.cmd -1 rts6201.1ib -o demol.out 


Compiling for the ’C67x: 
On a command line, enter the following on a single line: 


cl6x -g -o -k -mg -mv6700 demol.c macl.c vec_mpyl.c 
iirl.c -z 1lnk.cmd -1 rts6701.1ib -o demol.out 


You should not receive any errors, and the file, demo1.out, should be created. 
If you receive an error message, look up that error message in the appropriate 
user’s guide. 


Here is a description of what you told the shell program (cl6x) to do: 


cl6x Run the compiler and the assembler. 

-g Generate symbolic debugging directives that are used by 
the debugger. 

—0 Invoke the optimizer at the default level (—o is the same as 
—02). 


Not all optimizations work well with debugging because the 
optimizer’s rearrangement of code can make it difficult for 
you to correlate source code with object code. Using the —g 
option with the —o option allows for the maximum amount 
of optimization that is compatible with debugging. 


—k Keep the assembly output files. Notice that you now have 
the following .asm files in your current directory: 
demo1.asm, mac1.asm, vec_mpy1.asm, and iir1.asm. 


When the -k option is not used, the shell program deletes 
the assembly output files after assembly is complete. 


—mg Turn on the maximum amount of optimization that is com- 
patible with profiling. The —mg option allows you to profile 
optimized code. 


—mv6700 Compiler is invoked to target 'C67x devices. 


If this switch is not used, the compiler defaults to the ‘C62x 
device. This code will run on a ’C67x device, but it will run 
slower if using floating-point instructions since the code will 
have been compiled for the ’C62x device. 
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—Z Invoke the linker. The addition of this option to the cl6x com- 
mand line means that the code is compiled, assembled, 
and linked in one step. 


Ink.cmd Use Ink.cmd as the linker command file. Linker command 
files allow you to put linking information into a file, which is 
useful when you invoke the linker often with the same in- 
formation. 


Linker command files are also useful because they allow 
you to use the MEMORY directive, which defines the target 
memory configuration, and the SECTIONS directive, which 
controls how sections are built and allocated. 


—I rts6201.lib Include the runtime-support library for the ’C62x device, 
rts6201.lib, which is included on your CD-ROM. 


The runtime-support functions in rts6201 .lib were compiled 
for little-endian mode. For big-endian mode, use the run- 
time support functions in rts6201e.lib. 


—I rts6701.lib Include the runtime-support library for the ’C67x device, 
rts6701.lib, which is included on your CD-ROM. 


The runtime-support functions in rts6701 .lib were compiled 
for little-endian mode. For big-endian mode, use the run- 
time support functions in rts6701e.lib. 


—o demo1.out Name the output file demo1.out. (The default is a.out.) 


Because this option comes after the —z option, it is consid- 
ered a linker option and is interpreted differently than the —-o 
option that you entered before —z. 


Try This: 


The options above are used throughout the rest of this tutorial. 
They are fairly common and might be ones that you want to use repeatedly. 
To avoid having to retype them each time you run the code development tools, 
you can use the C_OPTION environment variable. The shell program uses the 
default options and/or input filenames that you name with the C_OPTION envi- 
ronment variable every time you run the shell. 


Use the commands in Table 2-1 to set up the C_OPTION environment vari- 
able with the options used on page 2-5. 
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Table 2-1. Using the C_OPTIONS Environment Variable 


Your Setup What to Change Command 

Windows NT™ System applet SET C_OPTION=~g —0 —k —mg ~z Ink.cmd —I rts6201.lib 
Windows™ 95 autoexec.bat SET C_OPTION=~g —0 —k —mg -z Ink.cmd —I rts6201.lib 

C shell .cshrc setenv C_OPTION "~g -0 —-k —mg —z Ink.cmd —I rts6201.lib” 
Bourne or Korn shell _ .profile setenv C_OPTION "~g -0 —k —mg —z Ink.cmd | rts6201 lib” 


Notice that the —-o demo1.out linker option was not included. If it were included, 
running the second tutorial example, demo2.c, would result in an output file 
named demo1.out instead of a more logical name such as demo2.out. 


Files must be explicitly called on command and not as an environment vari- 
able. To compile all of the C files in a directory, use the cl6x command with the 
appropriate options and use *.c where the files are normally indicated. For ex- 
ample: 


cl6x -g -mg *.c -z Ink.cmd -1 rts6201.lib -o demol.out 


| Important! | If you are using SunOS, be sure you reinitialize your shell before 
continuing with this tutorial: 


(J For C shells, enter the following on a command line: 


source ~/.cshrc 


_j For Bourne or Korn shells, enter the following on a command line: 


source ~/.profile 
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2.4 Lesson 2: Profiling the Example Code 


2.4.1 


There are several different methods to profiling your code. For those who use 
Code Composer 4.02 or Code Composer Studio 1.00, you should follow the 
method described in section 2.4.1, Using the Standalone Simulator for Profil- 
ing. Others that have an older version of the Tl debugger may follow either the 
method described in 2.4.1 or 2.4.2. 


Using the Standalone Simulator for Profiling 


There are two methods to using the standalone simulator (load6x) for profiling. 
If you are interested in just a profile of all of the functions in your application, 
there is an option in load6x. If you are interested in just profiling the cycle count 
of one or two functions or if you are interested in a region of code inside a par- 
ticular function, you can use calls to the clock() function (supported by load6x) 
to time those particular functions or regions of code. 


2.4.1.1 Using the -g Option to Profile on load6x 
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Invoking load6x with the -g option runs the standalone simulator in profiling 
mode. Source files must be compiled with the -mg profiling option for profiling 
to work on the standalone simulator. The profile results resemble the results 
given by the profiler in the Tl simulator debugger. The profile results are stored 
in a file called by the same name as the .out file, but with the .vaa extension. 


For example, to create a profile information file for the compiled and linked file 
named “example.out”, enter the following on your command line: 

load6éx -g example.out 
Now, you can edit the file “example.vaa’” to see the results of the profile session 
on the .out file. 


For example, if you followed the command line to build demo1.out 


cl6x -g -o -k -mg demol.c macl.c vec_mpyl.c iirl.c -z 
ink.cmd -1 rts6201.lib -o demol.out 


Then run demo1.out on load6x with profiling enabled: 


load6éx -g demol.out 
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A new file, demo1.vaa, should have been created in the same directory as the 
demo1.out file. Edit the demo1.vaa file with a text editor. You should see the 
following in the file: 


Program Name: demol.out 

Start Address: 00007980 main, at line 1, “demol.c” 
Stop Address: 00007860 exit 

Run Cycles: 3359 

Profile Cycles: 3339 

BP Hits: 11 


KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KEK KKK KKK KKK KKK KK KK KKK KK KKK 


Area Name Count Inclusive Incl-Max Exclusive 
Excl—-Max 
CE aaird.() 1 236 236 236 
236 
CF vec_mpyl () 1 248 248 248 
248 
CF macl () 1 168 168 168 
168 
CF main() 1 3333 3333 40 
40 


Count represents the number of times each function was called and entered. 
Inclusive represents the total cycle time spent inside that function including 
calls to other functions. Incl—-Max (Inclusive Max) represents the longest time 
spent inside that function during one call. Exclusive and Excl—Max are the 
same as Inclusive and Incl—Max except that time spent in calls to other func- 
tions inside that function have been removed. 


2.4.1.2 Using the clock() Function to Profile 


To get cycle count information for a function or region of code with the standa- 
lone simulator, embed the clock() function in your C code. Example 2—5 shows 
how to rewrite demo1.c to include the clock() function. 
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Example 2-5. Including the clock() Function in demo1.c (count.c) 


#include <stdio.h> 
#include <time.h> 


main(int argc, char *argv[]) 


{ 
const short coefs[150]; 
short optr[150]; 
short state[2]; 
const short a[150]; 
const short b[150]; 
int c = 0; 
int dotp[1] = {0}; 
int sum= 0; 
short y[150]; 
short scalar = 3345; 
const short x[150]; 
clock_t start, stop, overhead; 


start = clock(); 
stop = clock(); 
overhead = stop - start; 


start = clock(); 

sum = macl(a, b, c, dotp); 

stop = clock(); 

printf (”macl cycles: %d\n”, stop —- start - overhead); 


start = clock(); 

vec_mpyl(y, x, scalar); 

stop = clock(); 

printf (”vec_mpyl cycles: %d\n”, stop - start - overhead) ; 


start = clock(); 

iirl(coefs, x, optr, state); 

stop = clock(); 

printf ("iirl cycles: %d\n”, stop —- start - overhead); 


Se —— ———— ————————————— ————— ——— ———  —————— 


Note: 


When using this method, remember to calculate the overhead and include 


the appropriate header files. 
_ ey  ”p”pv~~”~—~——————}™V————— 
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Now, compile, assemble, and link count.c. 


If you did not set up your C_OPTIONS environment variable as described 
on page 2-6, enter the following on a command line: 


cl6x -g -o -k -mg count.c macl.c vec_mpyl.c iirl.c 
-z ink.cmd -1 rts6201.1ib -o count.out 


OR 

If you set up your C_OPTIONS environment variable as described on 
page 2-6, enter the following on a command line: 

c1l6x -z -o count.out 


Although the —z option is already specified in the C_OPTIONS environment 
variable, you need to specify it on the command line to indicate that this oc- 
currence of —o is a linker option. 


Use load6x to see the output of the printf statements that were embedded in 
the C code. 


On a command line, enter: 


load6x count.out 


You should see the following output: 


TMS320C6x C I/O COFF Loader Version 1.01 
Copyright (c) 1989-1997 Texas Instruments Incorporated 
Interrupt to abort 

macl cycles: 175 

vec_mpyl cycles: 324 

iirl cycles: 278 

NORMAL COMPLETION: 20949 cycles 


Notice that these cycle counts are higher than the cycle counts that you saw 
with the profiler. For example, mact is listed here as having 175 cycles; howev- 
er, it was listed in the Profiler window as having 167 cycles. You will see some 
extra cycles when you use load6x because you still have overhead for each 
function call. When you use the profiler, the cycles needed for calling the func- 
tions are not included in the profile display. 


The Using the Stand-Alone Simulator chapter in the TMS320C6x Optimizing 
C Compiler User's Guide discusses load6x in more detail. 


2.4.2 Using the Profiler in the Debugger 


Now, use the profiler to look at the output of demo1. In this lesson, you will use 
the profiler to see the total execution time in number of cycles of each C func- 
tion in demo1.out. 
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To start the profiler and load demo1.out, follow these steps: 


1) Double-click the icon for the debugger. 
From the Profile menu, select Profile Mode. 


The debugger switches to profiling mode and displays only the Com- 
mand, Disassembly, File, and Profile windows. 


From the File menu, select Load Program. 
This displays the Load Program File dialog box. 


Double-click the demo1.out file. To do so, you might need to change the 
working directory. 


This loads demo1.out into the profiler. Because the File window is re- 
served for C programs, it disappears. 

To select the areas of demo1 that you want profiled, follow these steps: 
From the Profile menu, select Select Areas. 
This displays the Profile Marking dialog box. 
In the Level box, select C. 


In the Area box, select Functions. 


This indicates that the C functions in demo1.out will be your profile 
areas. 


Click Mark. 


Profile Marking 


~ rea Marking 
- Level Area 


@c © Lines, Start 


© Assembly @™ Ranges. Stayt End | 


© Both @ Functions 
© All areas 


Module: [RR Enable | 
Function: [NA ~| Unmark | Disable | 


Close | Help | 


5) Click Close. 


The Profile window is updated to include a line for each C function in 
demot. 
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To start the profiling session, follow these steps: 


1) Click the run icon on the toolbar: 


This displays the Profile Run dialog box. 


Inthe Run Method box, select Quick, no exclusive fields. This will show 
you the total execution time (cycle count) of a profile area, including the 
execution time of any subroutines called within the functions. 


If main() is not already selected as your starting point, choose it from 
the list of starting points. 


Profile Run x | 


r- Run Method 
© Full, all fields 
@ Quick, no exclusive fields 
© Resume, [7] Clear data 


Often Never 
Display Rate: tp bel ele toe ee 


Start Point: [main ¥ 
Cancel | Help | 


4) Click OK. 


The Run Method dialog box closes and the status bar reads Target: 
Profiling to indicate that the profiling session has started. 
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The program restarts and runs to main( ) without profiling. Profiling begins 
when main( ) is reached and continues until the exit point of main( ) is reached. 
When profiling is complete, the status bar reads Target: Halted and your Profile 
window looks like this: 


fom) Profile 


C Function 
C Function 
C Function 


C Function 


Area Name Count Inclusive Incl—Max 


iirl() 
maci(} 
main } 
vec_mnpy1() 


The Inclusive column indicates the cycle counts for each function, including 
any function that it calls. Because these functions do not call any other func- 
tions, the inclusive cycle counts are the same as the exclusive cycle counts. 
Notice that the cycle count for the mac1() function is 167, and that the cycle 
counts for the vec_mpy1() and iir1( ) functions are much higher—316 and 
270, respectively. 


To interpret the cycle counts in the Profile window, you need to understand how 
they are calculated. Here is the formula for calculating cycle counts: 


Execute packets x loop iterations in C code + constant 


An execute packet is a group of parallel instructions. You can have up to eight 
instructions executing in parallel; therefore, each execute packet can contain 
up to eight instructions. An example of execute packets is shown in 
Example 2—7 on page 2-16. 


Table 2—2 shows how the cycle counts were calculated for each function. 


Table 2-2. Cycle Counts 


Function 


maci() 
vec_mpy1( ) 
iir1() 
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Execute Packets Loop Iterations Constant Cycle Count 
2 150 16 2x 150+16=316 
5 50 20 5 x 50+ 20 = 270 
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2.5 Lesson 3: Phase 1 of the Code Development Flow 


Looking at the functions in demo1 one at a time, you can determine whether 
or not they need to be improved and, if they do need to be improved, how they 
can be improved. Start by looking at the first function, mac1( ). 


Example 2-6 shows the assembly output of the function’s inner loop kernel. 
The loop kernel is the area of the loop with the most parallelism. Only the inner 
loop is shown, because this is the area that can be improved with software pi- 
pelining. Notice that there are eight instructions executing in parallel (as indi- 
cated by the seven sets of parallel bars). This is the maximum number of 
instructions that the ’C6x can execute in parallel, so this code does not need 
to be improved. 


Example 2-6. Inner Loop Kernel of mac1.asm 


L3: ; PIPED LOOP KERNEL 

ADD .L2 B4,B7,B7 ; 
| ADD -L1 A5,A3,A3 i 
MPY .M2X A4,B5,B4 7@@ 
| MPY -M1 A4,A4,A5 7@@ 

[ BO] B -S1 L3 7 @@@ee 
|| [ BO]  suUB .S2 BO,1,B0 7 @@@eRG 
| | LDH -D1 *AO++,A4 7 @@e@eeee 
| | LDH -D2 *B6++,B5 7 @@eeeeeR 


The @ characters specify the iteration of the loop that an instruction is on in 
the software pipeline; these symbols are automatically created by the code 
generation tools. The first iteration does not have an @ character; one @ char- 
acter represents the second iteration; two @ characters represents the third 
iteration, and so on. 


Because the mac1( ) function does not need to be improved, it does not need 
to go beyond phase 1 of the code development flow. 
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Look at Example 2-7, which shows the assembly output of the innermost loop 
forthe vec_mpy1() function. Recall from page 2-14 that the vec_mpy1( ) func- 
tion took 316 cycles to execute. This code is not as parallel as the mac1( ) func- 
tion. The assembly output for the vec_mpy1() function shows two execute 
packets. Each execute packet has four parallel instructions. This loop can be 
improved. 


Example 2-7. Inner Loop Kernel of vec_mpy1.asm 


Execute packets 
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L3: ; PIPED LOOP KERNEL 

ADD -L2X A3,B6,B5 ; 

[ Al] B -S1 L3 7@@ 
LDH -D2 *+B4(6),B6 7@@@ 
LDH «D1 *AO++,A4 7@@@@ 
STH -D2 B5, *B4t++ ; 
SHR pace A3,15,A3 7@ 
MPY M1 A5,A4,A3 7@@ 

[ Al] SUB -L1 Al1,1,Al1 7@@@ 
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Example 2-8 shows the assembly output of the innermost loop for the iir( ) 
function. Recall from page 2-14 that the iir1( ) function took 270 cycles to 
execute. As you can see, some execute packets have five parallel instructions, 
while others have as few as four parallel instructions, which indicates that the 
code can probably be improved. 


Example 2-8. Inner Loop Kernel of iir1.asm 


L3: ; PIPED LOOP KERNEL 

SHR -S2 B4,15,B4 ; 
SHR -S1 A3,15,A5 ; 
MPY -M2X B6,A5,B6 7@ 
LDH 2D1 *+A6(16),A4 ;@@ 
LDH .D2 *+B7(10),B6 ;@@ 
ADD -L1 AO,A5,A0 7 
MPY .M1X B6,A3,A3 7@ 
MPY -M2X B5,A4,B5 7@ 
LDH DAL *+A6(22),A3 ;@@ 
LDH -D2 *+B7(8),B5 7@@ 
EXT aS, AO,16,16,A0 ; 
STH “D2 B5, *+B7 (6) 7@ 
MPY -M1X B5,A3,A4 7@ 
LDH -D1 *+A6(20),A3 ;@@ 
ADD sol 8,A6,A6 ; 
STH .D2 AO, *B7++ (4) ; 
ADD .L1X AO,B4,A0 ; 

[ BO] SUB .-L2 BO,1,B0 7@ 
ADD siO2 B6,B5,B4 7@ 
EXT -S1 AO,16,16,A0 ; 

[ BO] B -S2 L3 7@ 
ADD ad: A3,A4,A3 7@ 
LDH 3D *4+A6(18),A5 ;@@@ 


To improve the vec_mpy( ) andiir( ) functions, start by seeing how you can re- 
fine and improve your C code. This is whatis referred to as phase 2 of the code 
development flow, and this is what the next lesson is about. 
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2.6 Lesson 4: Phase 2 of the Code Development Flow 


For your convenience, the vec_mpy1() function is duplicated here as 
Example 2-9 (the C version) and Example 2-10 (the assembly output of the 
inner loop). This is the same code that you saw in Example 2-3 and 
Example 2-7. 


Example 2-9. The Vector Multiply Function—vec_mpy1.c 


void vec_mpyl (short y[], const short x[], short scalar) 
int aise 


for (i = 0; i < 150; i++) 
+= ((scalar * x[i]) >> 15); 


Example 2-9 uses short data types. Because short data types are 16 bits, they 
translate into halfword instructions, such as LDH and STH (see 
Example 2-10). 


The loop in Example 2—10 uses two LDH instructions and an STH instruction 
to load x{i] and y[i] and store back to y[i]. Because only two memory operations 
can occur per cycle, the fastest that this loop can execute is one y[i] result ev- 
ery two cycles. The performance of this loop is limited by the number of D units. 


Example 2-10. Inner Loop Kernel of vec_mpy1.asm 
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Li3 3 ; PIPED LOOP KERNEL 

ADD -L2X A3,B6,B5 ; 

[ Al] B -S1 3 7@@ 
LDH -D2 *+B4(6),B6 7@@@ 
LDH »Dd. *AO++,A4 7@@@@ 
STH -D2 B5, *B4t++ ; 
SHR Sl A3,15,A3 7@ 
MPY M1 A5,A4,A3 7@@ 

[ Al] SUB poral Al1,1,Al1 7@@@ 


Because x is an array, x[i] and x[i + 1] are next to each other in memory. This 
means that instead of using halfword instructions (LDH and STH) to load and 
store each elementin the array, you can use word instructions (L.DW and STW) 
to load and store two elements at a time, as long as the data is aligned ona 
word boundary. In other words, all word accesses should have the 2 LSBs of 
the address set to 0. Two elements at a time, x[i] and x[i + 1], fit into one 32-bit 
register. 
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To achieve this in C, declare x[ ] as an integer instead of as a short data type. 
Also, you need to use some intrinsics. 


Now that you have determined that you can load x[i] and x[i + 1] into the same 
register, you need to figure out how to do it. You can do this by using the _mpy 
and _mpylh intrinsics. Intrinsics are like built-in C functions that correspond to 
’C6x assembly language instructions. The _mpy intrinsic multiplies the 
16 LSBs of one operand by the 16 LSBs of another and returns the result. The 
_mpylh intrinsic multiplies the 16 LSBs of the first operand by the 16 MSBs of 
the second and returns the result. 


You can then use the _add2 intrinsic to add the 16 MSBs of the first operand 
to the 16 MSBs of the second operand. At the same time, the _add2 intrinsic 
also adds the 16 LSBs of the first operand to the 16 LSBs of the second oper- 
and. The result of both additions is stored in a 32-bit operand. 


MSBs LSBs 
+ + 
MSBs LSBs 
MSBs LSBs 


Example 2-11 shows how to rewrite the vec_mpy( ) function to include the 
_mpy and _mpylh intrinsics: 


Example 2-11. The Revised Vector Multiply Function—vec_mpy2.c 


void vec_mpy2(int y[], const int x[], short scalar) 
{ 

int i, val; 

unsigned int temph, templ; 


for (i = 0; i < 75; itt) 
{ 


val = x[il; 

templ = (_mpy (scalar, val) >> 15) & OxO0000ffff; 
temph = (_mpylh(scalar, val) << 1) & Oxffff0000; 
yli] = _add2(y[i], temph | templ); 


} 
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Now, look at the iir1( ) function. Example 2-12 shows the same code that you 


saw in Example 2-4. 


Example 2-12. The Biquad Filter—iir1.c 


void iirl(const short *coefs, 
short *optr, 


{ 


short pe 
short t; 
int Nn; 


x = input[0]; 


} 


*xoptr++ = x; 


const short *input, 


short *state) 


for (n = 0; n < 50; n++) 
{ 
t = x + ((coefs[2] * state[0] + 
coefs[3] * state[1]) 
x = t + ((coefs[0] * state[0] + 
coefs[1] * state[1]) 
state[1] = state[0]; 
state[0] = t; 
coefs += 4; 
state += 2; 


>> 15); 


>> 15); 


/* point to next filter coefs 
/* point to next filter states */ 


*/ 
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You can improve the iir( ) function by using the same methods that you used 
to improve the vec_mpy( ) function. Example 2-13 shows how to rewrite the 
iir() function: 


Example 2-13. The Revised Biquad Filter—iir2.c 


void iir2(const int *coefs, const short *input, 
short *optr, short *state) 


{ 


short be 
short ts 
int n; 
x = input[0]; 


for (n = 0; n < 50; n+t+) 


t= x+((_mpy(coefs[1],state[0]) + 
_mpyhl (coefs[1],state[1])) >> 15); 


x= t+((_mpy(coefs[0],state[0]) + 


_mpyhl(coefs[0],state[1])) >> 15); 
state[1] = state[0]; 
state[0] =t; 


coefs += 2; 
state += 


No 
~e 


*xoptr++ = x; 
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Using demo2.c, shown in Example 2—14, run the revised functions through the 
compiler, assembler, and linker. 


Example 2-14. The Revised Example—demo2.c 


main(int argc, char *argv[]) 


{ 
const short coefs[100]; 
short optr[100]; 
short state[2]; 
const short a[100]; 
const short b[100]; 
int c = 0; 
int dotp[1] = {0}; 
int sum= 0; 
short y[100]; 
short scalar = 3345; 
const short x[100]; 


sum = macl(a, b, c, dotp); 
vec_mpy2(y, x, scalar); 
iir2(coefs, x, optr, state); 


If you did not set up your C_OPTIONS environment variable as described 
on page 2-6, enter the following on a command line: 


cl6x -g -o -k -mg demo2.c macl.c vec_mpy2.c iir2.c 
-z Iink.cmd -1 rts6201.1ib -o demo2.out 


OR 


If you set up your C_OPTIONS environment variable as described on 


page 2-6, enter the following on a command line: 


cl6x -z -o demo2.out 


Although the —z option is already specified in the C_OPTIONS environment 
variable, you need to specify it on the command line to indicate that this oc- 
currence of —0 is a linker option. 
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The inner loop of the vec_mpy2( ) function translates into the assembly output 


shown in Example 2-15. 


Example 2-15. Inner Loop Kernel of vec_mpy2.asm 


L3: ; PIPED LOOP KERNEL 

OR .L2X B5,A8,B7 
SHL OL A6,1,A4 

[ Al] B +S2 L3 
AND -L1 A5,A4,A6 
LDW -D2 *4+B4(12),B5 
MPYLH -M1 AO,A9,A6 
LDW -D1 *A3++,A9 
STW .D2 B6, *B4++ 
ADD2 -S2 B5,B7,B6 
AND -L1 A7,A4,A8 
MV .L2X A6,B5 

[ Al] SUB -D1 Al,1,Al1 
SHR -S1 A8,15,A4 
MPY -M1 AO,A9,A8 


7@ 

7@@ 

7@@ 

7@@ 
7@@@ 

7 @@@ 

7 @@@ERe 


As you can see, the code for the vec_mpy2( ) function is improved over the 
original vec_mpy() code. Two LDW instructions are loading four elements 
(x[i], x[i+1], y[i], and y[i+1]), and one STW instruction is storing two elements: 
x[i] and y[i+1]. With the revised code, two y[i] results are stored every two 
cycles. Recall that only one y([i] result was stored every two cycles in 


Example 2-10. 


Table 2-3 shows how the vec_mpy( ) function has improved as it moves from 


phase 1 to phase 2. 


Table 2-3. Revised Cycle Counts for vec_mpy( ) 


Function Execute Packets Loop Iterations 
vec_mpy1( ) 2 150 
vec_mpy2( ) 2 75 


Constant 
16 


22 
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Cycle Count 
2 x 150+ 16 =316 


2x 75+ 22=172 
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Now, look at the inner loop of the third function, iir( ). Example 2-16 shows the 
assembly output of the innermost loop for the revised iir( ) function: 


Example 2-16. Inner Loop Kernel of iir2.asm 


L3: ; PIPED LOOP KERNEL 
ADD .L2 B7,B8,B7 : 
ADD Pala AO,A3,A0 ; 
MV 282 B6,B9 7@ 
STH eDd A5, *+A4 (6) 7@ 
LDW .D2 *B5++(8),B8 ;@@ 
SHR 282 B7,15,B7 ; 
EXT sy AO,16,16,A0 j; 
[ BO] SUB Paes BO,1,B0 7@ 
MPY .M2X B8,A5,B8 7@ 
ADD .L1X B6,A3,A3 7@ 
LDH »D2 *+B4(14),B6 ;@@@ 
ADD sL1x AO,B7,A6 H 
MPYHL .M2 B8,B9,B7 7@ 
SHR Pysy A3,15,A3 7@ 
{ BO] B 252 L3 7@ 
LDW .D2 *+B5(4),B7 7 @@@ 
LDH «Di *4A4(12),A5 ;@@@ 
ADD .L2 4,B4,B4 : 
STH .D1 AO, *A4++(4) ; 
EXT .S1 A6,16,16,A0 ; 
MPYHL .M2 B7,B6,B6 7@@ 
MPY .M1X B7,A5,A3 7@@ 


Table 2-4 shows how the iir( ) function has improved. Now, the code has only 
four execute packets; however, each packet has only five or six parallel 
instructions, which could be probably improved. 


Table 2-4. Revised Cycle Counts for iir( ) 


Function Execute Packets Loop Iterations Constant Cycle Count 
iir1() 5 50 20 5 x 50 + 20 = 270 
iir2() 4 50 20 4 x 50 + 20 = 220 
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Use the profiler to view the cycle counts of the revised functions. 


Your profile window should look like this: 


Fi) Profile 


|- |ol x} 
Area Name 


Count Inclusive Incl—Max 


C Function vec_mpy2(} 1 172 172 
C Function 2172 () 1 220 220 
C Function wmacl({) 1 167 167 
C Function main() 1 637 637 


Notice that the cycle count for the second function, the vector multiply, is down 
from 316 to 172. The IIR filter has improved also: it is down from 270 to 220. 
However, the cycle count for the IIR filter is still too high. Naturally, the cycle 
count for main( ) has decreased also. It is down from 831 to 637. 


Table 2-5. Revised Cycle Counts 


Function Execute Packets Loop Iterations Constant Cycle Count 
mac1()t 1 150 17 1 x 150 +17 = 167 
vec_mpy2( ) 2 75 22 2x 75+22=172 
iir2() 4 50 20 4 x 50 + 20 = 220 


T The cycle count for the mac1( ) function has not changed. 
You have done everything you can to refine the C code in the iir( ) function. To 


improve your code at this point, you need to use the assembly optimizer. This 
leads you to phase 3 of the code development flow. 
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2.7 Lesson 5: Phase 3 of the Code Development Flow 
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To further improve the iir( ) function, you will need to rewrite it in linear assem- 
bly. Linear assembly is the input for the assembly optimizer. 


Linear assembly is similar to regular ’C6x assembly code in that you use ’C6x 
instructions to write your code. With linear assembly, however, you do not need 
to specify all of the information that you need to specify in regular ’C6x assem- 
bly code. With linear assembly code, you have the option of specifying the in- 
formation or letting the assembly optimizer specify it for you. Here is the in- 
formation that you do not need to specify in linear assembly code: 


(j] Parallel instructions 

Lj Pipeline latency 

_j Register usage 

(1 Which functional unit is being used 


If you choose not to specify these things, the assembly optimizer determines 
the information that you do not include, based on the information that it has 
about your code. As with other code generation tools, you might need to modify 
your linear assembly code until you are satisfied with its performance. When 
you do this, you will probably want to add more detail to your linear assembly. 
For example, you might want to specify which functional unit should be used. 


Before you use the assembly optimizer, you need to know the following things 
about how it works: 


Lj A linear assembly file must be specified with a .sa extension. 


(4 Linear assembly code should include the .cproc and .endproc directives. 
The .cproc and .endproc directives delimit a section of your code that you 
want the assembly optimizer to optimize. Use .cproc at the beginning of 
the section and .endproc at the end of the section. In this way, you can set 
off sections of your assembly code that you want to be optimized, like pro- 
cedures or functions. 


(J) Linear assembly code may include a .reg directive. The .reg directive al- 
lows you to use descriptive names for values that will be stored in regis- 
ters. When youuse .reg, the assembly optimizer chooses aregister whose 
use agrees with the functional units chosen for the instructions that oper- 
ate on the value. 


(1 Linear assembly code may include a .trip directive. The .trip directive 
specifies the value of the trip count. The trip count indicates how many 
times a loop will iterate. 


Now that you have some information about the fundamentals of linear assem- 
bly code, look at the revised C code for the biquad filter again. Example 2-17 
shows the same code that you saw in Example 2-13 on page 2-21. 
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Example 2-17. The Revised Biquad Filter—iir2.c 


void iir2(const int *coefs, 


const short *input, 


{ 


short x; 
short ty 
int n; 
x = input[0]; 


short *optr, short *state) 


for (n = 0; n < 50; n++) 


t= x+((_mpy(coefs[1],state[0]) + 
_mpyhl (coefs[1],state[1])) >> 15); 


x= t+((_mpy(coefs[0],state[0]) + 
_mpyhl(coefs[0],state[1])) >> 15); 


state[1] = state[0]; 
state[0] = t; 

coefs += 2; 

state += 2; 


*xoptr++ = x; 


Example 2-18 shows how to rewrite the iir( ) function in linear assembly. 
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Example 2-18. The Biquad Filter, Revised and Assembly-Optimized—iir3.sa 


.def sae 3 
1173 -cproc cptr0,sptr0d 
-reg cptrl, s01, s10, s23, cl10, 
-reg pO, pl, p2, p3, s23_s, sl, 
MV <2 cptr0,cptrl 
MV el sptr0O,sptrl 
MVK 50,ctr 
LOOP: strip 50 

LDW D1T1 *cptr0++[2],c32 
LDW D2T2 *optr1++[2],c10 
LDW D1T2 *sptr0,sl10 
MV 2 s10,sl10p 
MPY -M1 c32,s10,p2 
MPYH .M1 c32,s10,p3 
ADD 1 p2,p3,s23 
SHR 1 s23,15,s23_s 
ADD 2 823_8,;x%,t 
AND 2 t,mask,t 
MPY -M2 c10,s10,p0 
MPYH .M2 c10,s10,pl 
ADD 2 p0,pl,si10_t 
SHR 2 s10_t,15,s10_s 
ADD 2 sl10_s,t,;x 
SHL 2 s10p,16,s1 
OR 2 t,s1,s01 
STW sDL s01,*sptrlit++ 

[ctr] ADD aot =1 Ct xr,-cer 

[ctr] B sod LOOP 
-endproc 


C32, 
tj; X;¥ 


s10_s, 
mask, 


s10_t 
sptrl, 


sl0p, ctr 


, setup loop counter 


, coefAddr [3] 
; CoefAddr [1] 
; StateAddr[1] 


; save StateAddr[1] 


; CoefAddr[2] 
, CoefAddr [3] 


& CoefAddr [2] 
& CoefAddr [0] 
& StateAddr [0] 
& StateAddr [0] 


* StateAddr [0] 
* StateAddr[1] 


; CA[2] * SA[O] + CA[3] * SA[1] 
7 (CA[2] * SA[O] + CA[3] * SA[1]) >> 15 
; t = x+((CA[2]*SA[0]+CA[3]*SA[1])>>15) 


; clear upper 


; CoefAddr [0] 
; CoefAddr[1 
7; CA[O] 

7; (CA[O] 
; x= 


; StateAddr[1] 
; StateAddr [0] 


; store StateAddr[1] 


* SA[O] 
* SA[O] + CA[1] 
t+((CA[0]*SA[0]+CA[1]*SA[1])>>15) 


16 bits 


* StateAddr [0] 
* StateAddr[1] 
+ CA[1] * SA[1] 


* SA[1]) >> 15 


= StateAddr[0] 
=t 
& StateAddr [0] 


; dec outer lp cntr 
; Branch outer loop 
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Using demo3.c, shown in Example 2-19, run the revised functions through the 
code generation tools. 


Example 2-19. The Revised Example—demo3.c 


main(int argc, char *argv[]) 


{ 
const short coefs[150]; 
short optr[150]; 
short state[2]; 
const short a[150]; 
const short b[150]; 


int c = 0; 
int dotp[1] = {0}; 
int sum = 0; 


short y[150]; 
short scalar = 3345; 
const short x[150]; 


sum = macl(a, b, c, dotp); 
vec_mpy2(y, x, scalar); 
iir3(coefs, x, optr, state); 


Use the shell program (cl6x) to compile, assemble, and link. Be sure you use 
the —mg option. The —mg option ensures that the optimizations that are used 
are compatible with profiling. 


On a command line, enter: 


cl6x -g -o -k -mg demo3.c macl.c vec_mpy2.c iir3.sa 
-z Ink.cmd -1 rts6201.1ib -o demo3.out 


Notice that you used the shell program to compile a linear assembly file and 
aC file at the same time. Also notice that (except for the —mg option) you used 
the same options that you used in the first part of this tutorial. The assembly 
optimizer has a small set of some unique options, but many of the options that 
you will use are shell options that apply to either linear assembly files or C files. 
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Example 2-20. Inner Loop Kernel of iir3.asm 


3 ; PIPED LOOP KERNEL 
AND eZ B3,B7,BO ; Clear upper 16 bits 
ADD -S2 BO, B8,B8 7@ CA[O] * SA[O] + CA[1] * SA[1] 
[ Al] B -S1 L3 7@ Branch outer loop 
ADD Pe ra A4,A5,A4 7@ CA[2] * SA[O] + CA[3] * SA[1] 
MPYH -M2 B2,B1,B8 ;@@ CoefAddr[1] * StateAddr[1] 
MPY -M1X AO,B1,A4 7;@@ CoefAddr[2] * StateAddr[0] 
LDW -D2 *B6++(8),B2 ;@@@@ CoefAddr[1] & CoefAddr[0] 
LDW «D1 *A3++(8),A0 ;@@@@ coefAddr[3] & CoefAddr[2] 
ADD .D2 B4,B0,B9 7; & = t+((CA[O]*SA[0]+CA[1]*SA[1])>>15) 
OR ¢TeZ BO,B9,BO ; StateAddr[0] =t 
SHR -S2 B8, Oxf, B4 7@ (CA[O] * SA[O] + CA[1] * SA[1]) >> 15 
SHR -S1 A4,0xf,A5 7@ (CA[2] * SA[O] + CA[3] * SA[1]) >> 15 
MPY -M2 B2,B1,BO ;@@ CoefAddr[0] * StateAddr[0] 
MPYH -M1X AO,B1,A5 ,;@@ CoefAddr[3] * StateAddr[1] 
LDW -D1 *A6++,B1 7;@@@@ StateAddr[1] & StateAddr[0] 
STW «D1 BO, *A7++ ; store StateAddr[1] & StateAddr[0] 
SHL - 82 B5,0x10,B9 ;@ StateAddr[1] = StateAddr[0] 
ADD -L2X B9,A5,B3 7@ t = x+((CA[2]*SA[0]+CA[3]*SA[1])>>15) 
{ Al] ADD SL Oxffffffff,Al,Al ;@@ dec outer 1p cntr 
MV D2 B1,B5 7;@@ save StateAddr[1] & StateAddr[0] 
Table 2-6 shows how the iir( ) function has improved as it has moved through 
the three phases of code development. 
Table 2-6. Revised Cycle Counts for iir( ) 
Function Execute Packets Loop Iterations Constant Cycle Count 
iir1() 6 50 20 6 x 50+ 20 = 270 
iir2( ) 4 50 20 4 x 50 + 20 = 220 
iir3() 3 50 27 3 x 50+27=177 
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Use the profiler to view the cycle counts of the revised functions. 


Your profile window should look like this: 


Fin) Profile |. {OF x! 
Area Nane Count Inclusive Incl—Max 

C Function vec_mpy2() 1 172 172 

C Function 1ir3() pi 17? 177 

C Function  wmacl()} 1 167 167 

C Function wmain() z 594 594 


Notice that the cycle count for the IIR filter has improved: it is down from 220 


to 177. Naturally, the cycle count for main( ) has decreased also. It is down 
from 637 to 594. 


Table 2-7. Revised Cycle Counts 


Function Execute Packets Loop Iterations Constant Cycle Count 
mac1()t 1 150 17 1 x 150 +17 = 167 
vec_mpy2( )t 2 75 22 2 x 75+ 22=172 
iir3() 3 50 27 


3 x 50+ 27=177 
TThe cycle count for the mac1( ) function and the vec_mpy( ) function have not changed. 


The Using the Assembly Optimizer chapter in the TMS320C6x Optimizing C 
Compiler User’s Guide discusses the assembly optimizer in more detail. 


Code Development Flow Tutorial 2-31 


Part I 


Part I 


Summary 


2.8 Summary 
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Congratulations! In this tutorial, you learned the following things: 


L 


What the three phases of code development are, how to determine which 
phases are appropriate for improving different parts of your code, and how 
to write your code for each phase. 


What a linear assembly file is and some fundamental information on how 
to write one. 


How to use the code generation tools to compile, assemble, and link your 
C and linear assembly files. 


How to use the profiler to analyze your results and determine whether or 
not you need to continue refining your code. 


Chapter 3 


TMS320C6x Optimization Checklist 


Part I 


Because most ofthe millions of instructions per second (MIPS) in DSP applica- 
tions occur in tight loops, it is important for the ‘C6x code generation tools to 
make maximal use of all the hardware resources in important loops. Fortu- 
nately, loops inherently have more parallelism than non-looping code because 
there are multiple iterations of the same code executing with limited depen- 
dencies between each iteration. Through a technique called software pipelin- 
ing, the ’C6x code generation tools use the multiple resources of the VelociT| 
architecture efficiently and obtain very high performance. 


This chapter shows the code development flow recommended to achieve the 
highest performance on loops and provides a checklist that can be used to op- 
timize loops with references to more detailed documentation. 
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Table 3-1 describes the steps recommended for developing code to achieve 
the highest performance on loops. 


Table 3—1. Code Development Steps 


Step 
1 


4a 


4b 


Description 


Compile and profile native C code 


1) Validates original C code 


() Determines which loops are most important in terms of MIPS require- 
ments 


Add const declarations and loop count information 
(j Reduces potential pointer aliasing problems 
J Allows loops with indeterminate iteration counts to execute epilogs 


Lj Uses _nassert() intrinsic to pass loop count information to the compiler 


Optimize C code using other ’C6x intrinsics and other methods 


J Facilitates use of certain ’C6x instructions not easily represented in C 


(J Optimizes data flow bandwidth (uses word access for short (’C62x) 
data and double word access for word (’'C67x) data) 


Write linear assembly 
(j Allows control in determining exact ’C6x instructions to be used 


Lj Provides flexibility of hand-coded assembly without worry of pipelining, 
parallelism, or register allocation 


() Can pass memory bank information to the tools 


Lj) Uses .trip directive to convey loop count information 


Add partitioning information to the linear assembly 
() Can improve partitioning of loops when necessary 


Lj Can avoid bottlenecks of certain hardware resources 


Code size considerations 


j) Can trade small performance degradation for smaller code on loops 


(j) Can significantly reduce code size for control code 


When you achieve the desired performance in your code, there is no need to 
move to the next step. Each of the steps in the development involve passing 
more information to the ’C6x tools. Even at the final step, development time 
is greatly reduced from that of hand-coding, and the performance approaches 
the best that can be achieved by hand. 
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Internal benchmarking efforts at Texas Instruments have shown that most 
loops achieve maximal throughput after steps 1 and 2. For loops that do not, 
the C compiler offers a rich set of optimizations that can fine tune all from the 
high level C language. For the few loops that need even further optimizations, 
the assembly optimizer gives the programmer more flexibility than C can offer, 
works within the framework of C, and is much like programming in higher level 
C. For more information on the assembly optimizer, see the TMS320C6x Opti- 
mizing C Compiler User’s Guide and Chapter 7, Optimizing Assembly Code 
via Linear Assembly, in this book. For example, linear assembly files point to 
the demo directory included with the ’C6x tools. 


In order to aid the development process, a feedback option (—mw) is included 
in the code generation tools. Example 3-1 shows output from the compiler 
and/or assembly optimizer of a particular loop. See the TMS320C6x Optimiz- 
ing C Compiler User’s Guide for more information about the —mw option. 
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Example 3—1. Compiler and/or Assembly Optimizer Feedback 
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o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 


SOFTWARE PIPELINE INFORMATION 


Loop label : LOOP 

Known Minimum Trip Count 

Known Maximum Trip Count 

Known Max Trip Count Factor 

Loop Carried Dependency Bound (%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 


units 
units 
units 
units 
cross 


.T address paths 3 
Long read paths 2 
Long write paths 0 
Logical ops (.LS) 1 
3 
3 
3 


Addition 
Bound (.] 


A-side B-side 
1 
3 


+ 


1 
3 
Qe 


paths 


-L or .S unit) 
» or .S or .D unit) 


ops (.LSD) 
L .S .LS) 


OArPAIOOFRFWNNU 


Bound (.] 


+ 


-S .D .LS .LSD) 


Searching for software pipeline at... 
ii = 5 Schedule found with 3 iterations in parallel 


Done 


Speculative Loop Threshold : Unknown 
Collapsed Epilog Stages : 2 


Prolog not removed : Cannot speculate or predicate instruction 
Collapsed Prolog Stages : 0 


This feedback is important in determining which optimizations might be useful 
for further improved performance. The following checklist is provided as a 
quick reference to techniques that can be used to optimize loops and refers 
to specific sections within this book for more detail. 


TMS320C6x Optimization Checklist 


Table 3-2. TMS320C6x Optimization Checklist 


For more information, 
Feedback Solution refer to ... Page # 


Loop carried dependency | C code 
bound is much larger than 
unpartitioned resource 
bound 


yw“ Use -pm program level optimiza- | Performing Program— 
tion to reduce memory pointer | Level Optimization (-pm 
aliasing Option) 


Add const declarations to all point- | The const Keyword 
ers passed to a function that are 
read only 


yw Use -mt option to assume no | Memory Dependencies 
memory pointer aliasing 


Memory Alias Disambi- App p 
Linear assembly guation 
yw Use the .mdep and .no_mdep as- | Assembly Optimizer Op- _||7-4 
sembly optimizer directives tions and Directives 
Partitioned resource Write code in linear assembly with | Linear Assembly Re- 7-24 
bound is higher than un- partitioning/functional unit infor- | source Allocation 
partitioned resource mation 
bound 
Too many instructions, or Use intrinsics in C code to select | Using Intrinsics 4-13 
inefficient instructions more efficient ‘C6x instructions 
ene leas ps Write code in linear assembly to | Optimizing Assembly 
p pick exact ‘C6x instruction to be | Code via Linear Assem- 
executed bly 
Failed to software pipeline Use the -mx option for both C code | Software Pipelining 4-41 
due to register live—too— and linear assembly Retry 
ie Write linear assembly and insert | Split-Join—Path Prob- 7-105 
MV instructions to split register | lems 
lifetimes that are live—too—long 
Failed to software pipeline | Try splitting the loop into two sepa- | Optimizing Assembly 


due to register allocation rate loops Code via Linear Assem- 
(Cannot allocate machine bly 


: Linear Assembly 
registers) 


~~ Repartition if too many instruc- 
tions on one side 


yw Use symbolic register names 
instead of machine registers 
(AO-A15, BO-B15) 
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Table 3-2. TMS320C6x Optimization Checklist (Continued) 


Feedback Solution 


Failed to software pipeline | Trysplitting the loop into two sepa- 


due to register allocation 
(Too many predicates live 
on one side) 


T address paths are re- 
source bound 


There are memory bank 

conflicts (specified in the 
memory analysis window 
of simulator) 


Larger outer loop over- 
head in nested loop 


Uneven resources (for ex- 
ample, 3 multiplies per it- 
eration) 


rate loops 


If multiple conditionals are used in 
the loop, allocation of these condi- 
tionals is the reason for the failure. 
Try writing linear assembly and 
partition all instructions, writing to 
condition registers evenly be- 
tween the A and B sides of the ma- 
chine. If there is an uneven num- 
ber, put more on the B side, since 
there are 3 condition registers on 
the B side and only 2 onthe A side. 


C code 


al 


od 


Use word accesses for short ar- 
rays; declare int * (or use _nas- 
sert) and use mpy intrinsics to 
multiply upper and lower halves of 
registers 


Try to employ redundant load 
elimination technique if possible 


Linear Assembly 


jn 


Use LDW/STW instructions for ac- 
cesses to memory 


Write linear assembly and use the 
-mptr directive 


Unroll the inner loop 


Make one loop with the outer loop 
instructions conditional on an in- 
ner loop counter 


Unroll the loop to make an even 
number of resources 


For more information, 
refer to ... 


Using Word Accesses 
for Short Data in Part Il 


Redundant Load Elimi- 
nation 


Using Word Access for 
Short Data in Part Ill 


The .mptr Directive 


Loop Unrolling 


Loop Unrolling in Part II 
And Part III 


Outer Loop Conditionally 
Executed With Inner 
Loop 


Loop Unrolling in Part III 


Page # 
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Feedback 


Two loops are generated, 


one not software pipelined 


Did not find schedule 


(Too many reads of one 
register) 


(Cycle Count too high. 
Not profitable) 


Address increment too 
large 


Iterations in parallel > min. 


trip count 


Iterations in parallel > 
max. trip count 


Solution 
C code 


yw Use -pm program level optimiza- 
tion to gather more trip count infor- 
mation 


yw Usethe_nassert intrinsic to speci- 
fy loop count information 


Linear Assembly 


yw Use the .trip directive to specify 
loop count information 


Split into multiple loops or reduce 
the complexity of the loop if pos- 
sible 


Linear Assembly 


yw“ Unpartition/repartition the linear 
assembly source code 


“Probably best modified by another 
technique (i.e. loop unrolling) 


“Modify the register and/or partition 
constraints in linear assembly 


1“ Modify code so that the memory 
offsets are closer 


yw Use -pm program level optimiza- 
tion to gather more trip count infor- 
mation 


y Add _nassert or .trip to provide 
more information on the minimum 
trip count 


yw Make sure that code size flag 
(—ms0) is not used in the compiler 
options 


“Probably best optimized by anoth- 
er technique (i.e. unroll the loop 
completely) 


For more information, 
refer to ... 


Performing Program 4-11 
Level Optimization (-pm 


Option) 
Communication Trip- 


Count Information to the 
Compiler 
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The .trip Directive 7-8 


Loop Unrolling in Part II 
And Part Ill 


Performing Program 
Level Optimization (-pm 


Option) 


Communicating Trip 
Count Information to the 
Compiler 


The .trip Directive 7-8 


4-34 


4-36 
7-95 


Loop Unrolling in Part II 
And Part Ill 
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Table 3-2. TMS320C6x Optimization Checklist (Continued) 


For more information, 
Feedback Solution refer to ... Page # 


Trip var. used in loop - Replicate the trip count variable | What Disqualifies a Loop 
Can’t adjust trip count and use the copy inside the loop | From Being Software Pi- 

so that the trip counter and the | pelined 

loop reference separate variables 


Loop will not software Make sure there are no function | What Disqualifies a 4-41 
pipeline for other reasons calls, branches to other code, or | Loopfrom Being Soft- 
conditional break or continue | ware Pipelined 
statements in the loop. 


Try making the loop counter down | Tips on Data Types, 4-21 
counting and declare it an int in C | Trip Count Issues 4-33 


Refer to Section 4.3.3.7 What Dis- 
qualifies a Loop from Being Soft- 
ware—Pipelined for a full list of po- 
tential reasons 
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Chapter 4 


Optimizing C Code 


You can maximize C performance by using compiler options, intrinsics, and 
code transformations. This chapter discusses the following topics: 


_j The compiler and its options 
_j Intrinsics 
_j Software pipelining 


_j Loop unrolling = 

5 
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4.1 Writing C Code 


This chapter shows you how to analyze and tailor your code to be sure you are 
getting the best performance from the ’C6x architecture. 


4.1.1 


Tips on Data Types 


Give careful consideration to the data type size when writing your code. The 
’C6x compiler defines a size for each data type (signed and unsigned): 


(Jj char 8 bits 

LJ short 16 bits 

Lj int 32 bits 

Lj] long 40 bits 

Lj float 32 bits 

[j} double 64 bits 

Based on the size of each data type, follow these guidelines when writing C 
code: 

(1 Avoid code that assumes that int and long types are the same size, 


L 


because the ’C6x compiler uses long values for 40-bit operations. 


Use the short data type for fixed-point multiplication inputs whenever 
possible because this data type provides the most efficient use of the 
16-bit multiplier in the ‘C6x (1 cycle for “short * short” versus 5 cycles for 
“int * int”). 


Use int or unsigned int types for loop counters, rather than short or un- 
signed short data types, to avoid unnecessary sign-extension instructions. 


When using floating-point instructions on a floating-point device such as 
the ’C67x, use the —-mv6700 compiler switch so the code generated will 
use the device’s floating-point hardware instead of performing the task 
with fixed point hardware. For example, the RTS floating-point multiply will 
be used instead of the MPYSP instruction. 


4.1.2 Analyzing C Code Performance 
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Use the following techniques to analyze the performance of specific code 
regions: 


a) 


One of the preliminary measures of code is the time it takes the code to 
run. Use the clock( ) and printf( ) functions in C to time and display the 
performance of specific code regions. You can use the stand-alone simu- 
lator (load6x) to run the code for this purpose. Remember to subtract out 
the overhead of calling the clock() function. 


Writing C Code 


Lj Use the profile mode of the stand-alone simulator. This can be done by 
compiling your code with the —mg option and executing load6x with the —g 
option. The profile results will be stored in a file with the .vaa extension. 
Refer to the TMS320C6x Optimizing C Compiler User’s Guide for more 
information. 


(J Enable the clock and use profile points and the RUN commandin the Code 
Composer debugger to track the number of CPU clock cycles consumed 
by aparticular section of code. Use “View Statistics” to view the number 
of cycles consumed. 


_) The critical performance areas in your code are most often loops. The 
easiest way to optimize a loop is by extracting it into a separate file that 
can be rewritten, recompiled, and run with the stand-alone simulator 
(load6x). 


As you use the techniques described in this chapter to optimize your C code, _ 
you can then evaluate the performance results by running the code and 
looking at the instructions generated by the compiler. o 
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4.2 Compiling C Code 


The ’C6x compiler offers high-level language support by transforming your C 
code into more efficient assembly language source code. The compiler tools 
include a shell program (cl6x), which you use to compile, assembly optimize, 
assemble, and link programs ina single step. To invoke the compiler shell, en- 


ter: 


cl6x [options] [filenames] [-z [linker options] [object files]] 


For acomplete description of the C compiler and the options discussed in this 


chapter, see the TMS320C6x Optimizing C Compiler User’s Guide. 


4.2.1 Compiler Options 


Options control the operation of the compiler. Table 4—1 defines the options 


discussed in this chapter. 
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Option 


—o<nst 


—pmt 


—mt 


—ml <n> 


Table 4—1. Subset of Compiler Options 


Description 


Enables software pipelining and other optimizations in the com- 
piler 


Enables program-level optimization 


Enables the compiler to use assumptions that allow it to be 
more aggressive with certain optimizations. When used on 
linear assembly files, it acts like a .no_mdep directive that has 
been defined for those linear assembly files. 


Allows you to profile optimized code 


Allows you to reduce code size in loop code (—ms0) for a small 
performance degradation and reduce code size in control code 
(—ms2) 


Keeps the assembly file so that you can inspect it 


Disables software pipelining (useful in helping to debug linear 
assembly source code) 


Allows speculative execution 


Describes the interrupt threshold to the compiler (See section 
8.4.) 


Describes how to reach data and code (near/far) 


t Although —03 is preferable, at a minimum use the —o option. 
+ Use the —pm option for as much of your program as possible. 


Compiling C Code 


Option Description 


—mr <n> Describes how to call RTS routines (near/far) 


—mx Enables software pipelined loop retry. This option tries multiple 
schedules on loops and selects the best schedule based on the 
trip count information known about the loop. 


t Although -03 is preferable, at a minimum use the -o option. 
+ Use the —pm option for as much of your program as possible. 
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4.2.2 Memory Dependencies 


4-6 


To maximize the efficiency of your code, the ’C6x compiler schedules as many 
instructions as possible in parallel. To schedule instructions in parallel, the 
compiler must determine the relationships, or dependencies, between instruc- 
tions. Dependency means that one instruction must occur before another, for 
example, a variable must be loaded from memory before it can be used. 
Because only independent instructions can execute in parallel, dependencies 
inhibit parallelism. 


(1 Ifthe compiler cannot determine that two instructions are independent (for 
example, b does not depend on a), it assumes a dependency and sched- 
ules the two instructions sequentially accounting for any latencies needed 
to complete the first instruction. 


(J Ifthe compiler can determine that two instructions are independent of one 
another, it can schedule them in parallel. 


Often it is difficult for the compiler to determine if instructions that access 
memory are independent. The following techniques help the compiler deter- 
mine which instructions are independent: 


(1 Use the const keyword to indicate which objects are not changed by a 
function. 


1 Use the-pm (program-level optimization) option, which gives the compiler 
global access to the whole program or module and allows it to be more 
aggressive in ruling out dependencies. 


(J Use the —mt option, which allows the compiler to use assumptions that al- 
low it to eliminate dependencies. Remember, using the —mt option on lin- 
ear assembly code is equivalent to adding the .no_mdep directive to the 
linear assembly source file. Specific memory dependencies should be 
specified with the .mdep directive. For more information see section 4.4, 
Assembly Optimizer Directives in the TMS320C6x Optimizing C Compiler 
User’s Guide. 
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To illustrate the concept of memory dependencies, it is helpful to look at the 
algorithm code in a dependency graph. Example 4—1 shows the C code for a 
basic vector sum. Figure 4—1 shows the dependency graph for this basic vec- 
tor sum. (For more information, see section 7.3.4, Drawing a Dependency 
Graph, on page 7-11.) 


Example 4—1. Basic Vector Sum 


void vecsum(short *sum, short *inl, short *in2, unsigned int N) 


{ 


int i; 
for (i = 0; i < N; i++) 
sum[i] = inl[i] + in2[i]; 


Figure 4—1. Dependency Graph for Vector Sum #1 
Load Load 


Number of cycles required 


: : Add elements { 
to complete an instruction ————> 1 4 ¥ 
1 
Store to 
memory 
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The dependency graph in Figure 4—1 shows that: 


(j) The paths from sum[i] back to in1[i] and in2[i] indicate that writing to sum 
may have an effect on the memory pointed to by either in1 or in2. 


() A read from in1 or in2 cannot begin until the write to sum finishes, which 
creates an aliasing problem. Aliasing occurs when two pointers can point 
to the same memory location. For example, if vecsum( ) is called in a pro- 
gram with the following statements, in1 and sum alias each other because 
they both point to the same memory location: 


short a[10], b[10]; 
vecsum(a, a, b, 10); 


4.2.2.1 The const Keyword 
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In Figure 4—1, the reads from in1 and in2 finish before the write to sum within 
a single iteration. However, the ’C6x compiler uses software pipelining to exe- 
cute multiple iterations in parallel and, therefore, must determine memory 
dependencies that exist across loop iterations. 


To help the compiler, you can qualify an object with the const keyword, which 
indicates that a variable or the memory referenced by a variable will not be 
changed, but will remain constant. It is good coding practice to use the const 
keyword wherever you can, because it is a simple way to increase the perfor- 
mance and robustness of your code. 


Compiling C Code 


Example 4—2 shows the vecsum( ) example rewritten with the const keyword 
to indicate that the write to sum never changes the memory referenced by in1 
and in2. Figure 4—2 shows the revised dependency graph for the code in the 
inner loop. 


Example 4—2. Vector Sum With const Keywords 


void vecsum2 (short *sum, const short *inl, const short *in2, unsigned int N) 


{ 


int i; 
for (i = 0; i < N; i++) 
sum[i] = inl[i] + in2[i]; 


Figure 4-2. Dependency Graph for Vector Sum #2 
Load Load 


Add elements 


x . ¥ 
1 
Store to 


. memory 


Example 4-3 shows the output of the compiler for the vector sum in 
Example 4—2. The compiler finds better schedules when dependency paths 
are eliminated between instructions. For this loop, the compiler found a soft- 
ware pipeline with a 2-cycle kernel, compared with seven cycles for the 
previous loop. (The kernel is the body of a pipelined loop where all instructions 
execute in parallel.) 
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Example 4—3. Compiler Output for Vector Sum Code 
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L4: ; PIPE LOOP KERNEL 

ADD .L1X  B4,A0,A5 

[BO] B .S2 L4 

[Al] LDH .D1T1 *A3++,A0 

[A2] SUB .S1 A2, 1, A2 

[Al] SUB Tad Al, Ly -Al 

[!A2]STH .D1 A5, *A4++ 

[BO] SUB {12 BO, 1,BO0 
LDH .D2 *B5++,B4 


For basic information on assembly code, see Chapter 4, Structure of Assem- 
bly Code. 


The compiler has collapsed the prolog and epilog code for the loop into the ker- 
nel as conditional code. That is why the LDH and STH instructions are execut- 
ed conditionally. For more information on understanding loop prologs, kernels, 
and epilogs, refer to Chapter 6. 


Caution 


Do not use the const keyword if two pointers point to the same 


object in memory and one of those pointers modifies memory. 
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Example 4—4. Incorrect Use of the const Keyword 


void func (short *a, const short *b) /*Bad!! */ 
{ 

int i; 

for (i = 11; i < 44; i++) *(--a) = *(--b); 


} 
void main () 
{ 
short array[] = { Ll, 2,» 3, 4; By Oy Tr Ger By 10, 
1i, 12, 13, 14; 15; 16, 17, 18, 
19,. 20, 21, 22, 23, 24, 25; 26, 
27, 28; 29, 30, 31, 32, 33, 34, 
35, 36, 37, 38, 39. 40, 41, 42, 
43, 44}; 
short *ptrl, *ptr2; 


ptr2 = array + 44; 
ptrl = ptr2 - 11; 
func(ptr2, ptrl); /*Bad!! */ 


Do notuse the const keyword with code such as listed in Example 4—4. By us- 
ing the const keyword in Example 4-4, you are telling the compiler that it is le- 
gal to write to any location pointed to by a before reading the location pointed 
to by b. This is illegal because both a and b point to the same object —array. 


4.2.2.2 Performing Program-Level Optimization (-pm Option) 


You can specify program-level optimization by using the —pm option with the 
—03 option. With program-level optimization, all your source files are compiled 
into one intermediate file called a module. The module moves to the optimiza- 
tion and code generation passes of the compiler. Because the compiler has 
access to the entire program, it performs several optimizations that are rarely 
applied during file-level optimization: 


() If aparticular argument in a function always has the same value, the com- 
piler replaces the argument with the value and passes the value instead 
of the argument. 


_) lf areturn value of a function is never used, the compiler deletes the return 
code in the function. 


_) If a function is not called, directly or indirectly, the compiler removes the 
function. 


Also, using the —-pm option can lead to better schedules for your loops. Ifthe 
number of iterations of a loop is determined by a value passed into the function, 
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and the compiler can determine what that value is from the caller, then the 
compiler will have more information about the minimum trip count of the loop 
leading to a better resulting schedule. 


4.2.2.3. The-mt Option 
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Another way to eliminate memory dependencies is to use the —mt option, 
which allows the compiler to use assumptions that can eliminate memory de- 
pendency paths. For example, if you use the —mt option when compiling the 
code in Example 4—1, the compiler uses the assumption that that in1 and in2 
do not alias memory pointed to by sum and, therefore, eliminates memory 
dependencies among the instructions that access those variables. 


You would get the same loop kernel listed in Example 4—3. If your code does 
not follow the assumptions generated by the —mt option, you can get incorrect 
results. For more information on the —mt option refer to section 3.6.2 in the 
TMS320C6x Optimizing C Compiler User’s Guide. 


Refining C Code 


4.3 Refining C Code 


You can realize substantial gains from the performance of your C code by refin- 
ing your code in the following areas: 


1] Using intrinsics to replace complicated C code 


_j Using word access to operate on 16-bit data stored in the high and low 
parts of a 32-bit register 


_) Software pipelining the instructions manually 


_j Using double access to operate on 32-bit data stored in a 64-bit register 
pair ('C67x only) 


4.3.1. Using Intrinsics 


The ’C6x compiler provides intrinsics, special functions that map directly to 
inlined ’C62x/’C67x instructions, to optimize your C code quickly. All instruc- 
tions that are not easily expressed in C code are supported as intrinsics. Intrin- 
sics are specified with a leading underscore (_) and are accessed by calling 
them as you call a function. 


For example, saturated addition can be expressed in C code only by writing 
a multicycle function, such as the one in Example 4—5. 


Example 4—5. Saturated Add Without Intrinsics 


int sadd(int a, int b) 
{ 


int result; 
result =a +b; 


if (((a * b) & 0x80000000) == 0) 
{ 
if ((result * a) & 0x80000000) 
{ 
result = (a < 0) ? Ox80000000 : Ox 7ffffffFf; 
} 
} 


return (result); 


This complicated code can be replaced by the _sadd( ) intrinsic, which results 
in a single ’C6x instruction (see Example 4—6). 
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Example 4-6. Saturated Add With Intrinsics 


result = _sadd(a,b); 


Table 4—2 lists the ’C6x intrinsics. For more information on using intrinsics, see 
the TMS320C6x Optimizing C Compiler User’s Guide. 


Table 4-2. TMS320C6x C Compiler Intrinsics 


Assembly 
C Compiler Intrinsic Instruction Description Device 


int _abs(int src2); ABS Returns the saturated absolute value of 
int_labs(long src2); src2. 


int _add2(int src7, int src2); ADD2 Adds the upper and lower halves of srci to 
the upper and lower halves of src2 and re- 
turns the result. Any overflow from the 
lower half add will not affect the upper half 
add. 


uint _clr(uint src2, uint csta, uint cstb); CLR Clears the specified field in src2. The 
beginning and ending bits of the field to be 
cleared are specified by csta and cstb, 
respectively. 


unsigned _clrr(uint src7, int src2); CLR Clears the specified field in src2. The 
beginning and ending bits of the field to be 
cleared are specified by the lower 10 bits 
of the source register. 


int_dpint(double); DPINT Converts 64-bit double to 32-bit signed in- *C67x 
teger, using the rounding mode set by the 
CSR register. 


int _ext(uint src2, uint csta, int cstb); EXT Extracts the specified field in src2, sign-ex- 
tended to 32 bits. The extract is performed 
by a shift left followed by a signed shift 
right; csta and cstb are the shift left and 
shift right amounts, respectively. 


int _extr(int src2, int src7); EXT Extracts the specified field in src2, sign-ex- 
tended to 32 bits. The extract is performed 
by a shift left followed by a signed shift 
right; csta and cstb are the shift left and 
shift right amounts, respectively. 


Note: Instructions not specified with a device apply to all ‘C6x devices. 
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Table 4-2. TMS320C6x C Compiler Intrinsics (Continued) 
Assembly 


C Compiler Intrinsic Instruction Description Device 
uint _extu(uint src2, uint csta, uint cstb); EXTU Extracts the specified field in src2, zero- 
extended to 32 bits. The extract is 
performed by a shift left followed by a 
unsigned shift right; csta and cstb are the 
shift left and shift right amounts, respec- 
tively. 
uint _extur(uint src2, int src7); EXTU Extracts the specified field in src2, zero- 
extended to 32 bits. The extract is 
performed by a shift left followed by a 
unsigned shift right; csta and cstb are the 
shift left and shift right amounts, respec- 
tively. 
uint _ ftoi(float); Reinterprets the bits in the float as anun- ’C67x 
signed integer. 
(Ex: _ftoi(1.0) == 1065353216U) 
uint _hi(double); Returns the high 32 bits of adouble as an ’C67x 
integer. 
double _itod(uint, uint); Creates a new double register pair from °C67x 
two unsigned integers. 
float _itof(uint); Reinterprets the bits in the unsigned inte- *C67x 
ger as a float. 
(Ex: _itof(0x3f800000) == 1.0) 
uint _Imbd(uint src7, uint src2); LMBD Searches for aleftmost 1 or 0 of src2deter- 
mined by the LSB of src?. Returns the 
number of bits up to the bit change. 
uint _lo(double); Returns the low (even) registerofadouble °C67x 
register pair as an integer. 
int_mpy(int src7, int src2); MPY Multiplies the 16 LSBs of src1 by the 16 
int_mpyus(uint src7, int src2); MPYUS LSBs of src2 and returns the result. Values 
int_mpysu(int src7, uint src2); MPYSU can be signed or unsigned. 
uint_mpyu(uint src7, uint src2); MPYU 
int_mpyh(int src7, int src2); MPYH Multiplies the 16 MSBs of src1 by the 16 
int_mpyhus(uint src7, int src2); MPYHUS MSBs of src2 and returns the result. 
int_mpyhsu(int src7, uint src2); MPYHSU Values can be signed or unsigned. 
uint_mpyhu(uint src7, uint src2); MPYHU 
Note: Instructions not specified with a device apply to all ’C6x devices. 
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Table 4-2. TMS320C6x C Compiler Intrinsics (Continued) 


Assembly 
C Compiler Intrinsic Instruction Description Device 
int_mpyhl(int src7, int src2); MPYHL Multiplies the 16 MSBs of src1 by the 16 
int_mpyhuls(uint src7, int src2); MPYHULS _ LSBs ofsrc2 and returns the result. Values 
int_mpyhslu(int src7, uint src2); MPYHSLU _ canbe signed or unsigned. 
uint_mpyhlu(uint src7, uint src2); MPYHLU 
int_mpylh(int src7, int src2); MPYLH Multiplies the 16 LSBs of src1 by the 16 
int_mpyluhs(uint src7, int src2); MPYLUHS' MSBs of src2 and returns the result. 
int_mpylshu(int src7, uint src2); MPYLSHU _ Values can be signed or unsigned. 
uint_mpylhu(uint src7, uint src2); MPYLHU 
void _nassert(int); Generates no code. Tells the optimizer 
that the expression declared with the 
assert function is true. This gives a hint to 
the compiler as to what optimizations 
might be valid (trip count information for 
software pipelined loops and about using 
word-wide optimizations). 
uint _norm(int src2); NORM Returns the number of bits up to the first 
uint _Inorm(long src2); nonredundant sign bit of src2. 
double _rcpdp(double); RCPDP Computes the approximate 64-bit double *C67x 
reciprocal. 
float __rcpsp(float); RCPSP Computes the approximate 64-bit double °C67x 
reciprocal. 
double _rsqrdp(double src); RSQRDP Computes the approximate 64-bit double °C67x 
reciprocal square root. 
float_rsqrsp(float src); RSQRSP Computes the approximate 32-bit float re- °C67x 
ciprocal square root. 
int __sadd(int src7, int src2); SADD Adds src1 to src2 and saturates the result. 
long _Isadd(int src1, long src2): Returns the result. 
int__sat(long src2); SAT Converts a 40-bit value to an 32-bit value 
and saturates if necessary. 
uint _set(uint src2, uint csta, uint cstb); SET Sets the specified field in src2 to all 1s and 


Note: 


returns the src2 value. The beginning and 
ending bits of the field to be set are speci- 
fied by csta and cstb, respectively. 


Instructions not specified with a device apply to all ‘C6x devices. 
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Table 4-2. TMS320C6x C Compiler Intrinsics (Continued) 


Assembly 

C Compiler Intrinsic Instruction Description Device 
unsigned _ setr(unsigned, int); SET Sets the specified field in src2 to all 1s and 

returns the src2 value. The beginning and 

ending bits of the field to be set are speci- 

fied by the lower ten bits of the source reg- 

ister. 
int_smpy(int src7, int sr2); SMPY Multiplies src1 by src2, left shifts the result 
int_smpyh(int src7, int sr2); SMPYH by one, and returns the result. If the result 
int_smpyhl(int src7, int sr2); SMPYHL is 0x80000000, saturates the result to 
int_smpylh(int src7, int sr2); SMPYLH Ox7FFF FFFF. 
int _spint(float); SPINT Converts 32-bit float to 32-bit signedinte- °C67x 

ger, using the rounding mode set by the 

CSR register. 
uint_sshl(uint src2, uint src7); SSHL Shifts src2 left by the contents of src1, sat- 

urates the result to 32 bits, and returns the 

result. 
int_ssub(int src7, int src2); SSUB Subtracts src2 from src1, saturates the 
long _Issub(int src1, long src2): result size, and returns the result. 
uint__sube(uint src7, uint src2); SUBC Conditional subtract divide step. 
int__sub2(int src7, int src2); SUB2 Subtracts the upper and lower halves of 

src2 from the upper and lower halves of 

src1, and returns the result. Any borrowing 

from the lower half subtract does not affect 

the upper half subtract. 
Note: Instructions not specified with a device apply to all ’C6x devices. 
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4.3.2 Using Word Access for Short Data 


The ’C6x has instructions with corresponding intrinsics, such as _add2( ), 
_mpyhl( ), mpylh( ), that operate on 16-bit data stored in the high and low 
parts of a 32-bit register. When operating on a stream of short data, you can 
use word (int) accesses to read two short values at a time, and then use ’C6x 
intrinsics to operate on the data. For example, rewriting the vecsum( ) function 
to use word accesses (as in Example 4—7) doubles the performance of the 
loop. See section 7.4, Loading Two Data Values with LDW, on page 7-20 for 
more information. This type of optimization is called SIMD (Single Instruction 
Multiple Data). 


Example 4—7. Vector Sum With const Keywords, _nassert, Word Reads 


void vecsum4(short *sum, const short *inl, const short *in2, unsigned int N) 
{ 


int i; 


(const int *)inl; 
(const int *)in2; 
(int *) sum; 


const int *i_inl 
const int *i_in2 
int *i_sum 


_nassert (N >= 20); 


for (i = 0; i < (N/2); i++) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 


aS, 
Note: 


The _nassert intrinsic tells the optimizer that the code that follows meets the 


condition specified. 
ooo oH  hS Sa 


This transformation assumes that the pointers sum, in1, and in2 can be cast 
to int *, which means that they must point to word-aligned data. By default, the 
compiler aligns all short arrays on word boundaries; however, a call like the 
following creates an illegal memory access: 


short a[51], b[50], c[50]; vecsum4(&a[1], b, c, 50); 


Another consideration is that the loop must now run for an even number of 
iterations. You can ensure that this happens by padding the short arrays so 
that the loop always operates on an even number of elements. 
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If a vecsum( ) function is needed to handle short-aligned data and odd-num- 
bered loop counters, then you must add code within the function to check for 
these cases. Knowing what type of data is passed to a function can improve 
performance considerably. It may be useful to write different functions that can 
handle different types of data. If your short-data operations always operate on 
even-numbered word-aligned arrays, then the performance of your applica- 
tion can be improved. However, Example 4-8 provides a generic vecsum( ) 
function that handles all types of alignments and array sizes. 


Example 4—8. Vector Sum With const Keywords, _nassert, Word Reads (Generic Version) 


void vecsum5 (short *sum, const short *inl, const short *in2, unsigned int N) 


{ 


int i; 


_nassert(N >= 20); 
/* test to see if sum, in2, and inl are aligned to a word boundary */ 


if (((int)sum | (int)in2 | (int)inl) & 0x2) 
{ 
for (i = 0; i < Nj it+) 
sum[i] = inl[i] + in2[il]; 
} 
else 


(const int *)inl; 
(const int *)in2; 
(int *) sum; 


const int *i_inl 
const int *i_in2 
int *i_sum 


for (i = 0; i < (N/2); itt) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 


if (N & Oxl) sum[i] = inl[i] + in2[il]; 
} 
} 


4.3.2.1 Using Word Access in Dot Product 


Other intrinsics that are useful for reading short data as words are the multiply 
intrinsics. Example 4—9 is a dot product example that reads word-aligned short 
data and uses the _mpy( ) and_mpyh( ) intrinsics. The _mpyh( ) intrinsic uses 
the ’C6x instruction MPYH, which multiplies the high 16 bits of two registers, 
giving a 32-bit result. 


This example also uses two sum variables (Sum1 and sum2). Using only one 
sum variable inhibits parallelism by creating a dependency between the write 
from the first sum calculation and the read in the second sum calculation. 
Within a small loop body, avoid writing to the same variable, because it inhibits 
parallelism and creates dependencies. 
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Example 4-9. Dot Product Using Intrinsics 


int dotprod(const short *a, const short *b, unsigned int N) 
{ 


int i, suml = 0, sum2 = 0; 


const int *i_a 
const int *i_b 


(const int *)a; 
(const int *)b; 


for (i = 0; i < (N >> 1); itt) 

{ 
suml = suml + _mpy (i_al[i], i_blil); 
sum2 = sum2 + _mpyh(i_a[i], i_b[i]); 


return suml + sum2; 


4.3.2.2 Using Word Access in FIR Filter 


Example 4—10 shows an FIR filter that can be optimized with word reads of 
short data and multiply intrinsics. 


Example 4-10. FIR Filter-—Original Form 


void firl(const short x[], const short h[], short y[], int n, int m, int s) 
{ 

int i, Jj; 

long yO; 

long round = 1L << (s - 1); 


for (j = 0; j < m; j++) 
{ 


yO = round; 


for (i = 0; i < n; i++) 
yO += x[i + 3] * h[il]; 


y[j3] = (int) (yO >> s); 
} 
} 


Example 4-11 shows an optimized version of Example 4-10. The optimized 
version passes an int array instead of casting the short arrays to int arrays and, 
therefore, helps ensure that data passed to the function is word-aligned. As- 
suming that a prototype is used, each invocation of the function ensures that 
the input arrays are word-aligned by forcing you to insert a cast or by using int 
arrays that contain short data. 
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Example 4—11. FIR Filter— Optimized Form 


void fir2(const int x[], const int h[], short y[], int n, int m, int s) 
{ 

int i, Jj; 

long yO, yl; 

long round = 1L << (s - 1); 


_nassert (m >= 16); 
_nassert(n >= 16); 


for (j = 0; j < (m >> 1); J+t) 
{ 
yO = yl = round; 


for (2 = GQ? 2 <= (nm >> 1) i++) 

{ 
yO += _mpy (x[i + jl, h[i]); 
yO += _mpyh (x[i + jl, hfil); 
yl += _mpyhl(x[i + jl, h[i]); 
yl += _mpylh(x[i + j+ 1], hf[il]); 

} 

*yt++ = (int) (yO >> s); 

*yt++ = (int) (yl >> s); 


} 
} 
short x[SIZE_X], h[SIZE_H], y[SIZE_Y]; 


void main () 


{ 
firl((int *)x, (int *)h, y, n, ,m, s; 


} 


4.3.2.3. Using Double Word Access for Word Data (’C67x Specific) 


The ’C67x architecture has aload double word (LDDW) instruction, which can 
read 64 bits of data into a register pair. Just like using word accesses to read 
2 short data items, double word accesses can be used to read 2 word data 
items (or 4 short data items). When operating on a stream of float data, you 
can use double accesses to read 2 float values at a time, and then use intrin- 
sics to operate on the data. 


The basic float dot product is shown in Example 4—12. Since the float addition 
(ADDSP) instruction takes 4 cycles to complete, the minimum kernel size for 
this loop is 4 cycles. For this version of the loop, a result is completed every 
4 cycles. 
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Example 4—12. Basic Float Dot Product 


float dotpl(const float a[], const float b[]) 
{ 
int i; 
float sum = 0; 
for (i=0; 
sum += a[i] 


ai<5125 i++) 
* blil; 


return sum; 


In Example 4—13, the dot product example is rewritten to use double word 
loads and intrinsics are used to extract the high and low 32-bit values con- 
tained in the 64-bit double. The _hi() and _lo() instrinsics return integer values, 
the _itof() intrinsic subverts the C typing system by interpreting an integer val- 
ue as afloat value. In this version of the loop, 2 float results are computed every 
4 cycles. Recall that earlier it was said arrays are aligned on double word 
boundaries by using either the DATA_ALIGN (for globally defined arrays) or 
DATA_MEM_BANK (for locally defined arrays) pragmas.Example 4—13 and 
Example 4—14 show these pragmas. 


Example 4—13. Float Dot Product Using Intrinsics 


{ 
int 2 
float sum0 
float suml 


for (i=0; 


{ 


} 


} 


float ret_val, 


void main() 
{ 


ret_val = 


float dotp2(const double a[], const double b[]) 


1<512/2; i++) 
sum0 += _ 
suml += _ 
return sum0 


#pragma DATA_ALIGN (a, 
#pragma DATA_ALIGN(b, 8); 


0; 
0; 


itof (_hi(a[i])) * -ltof(_hi (bfi))); 
itof(_lo(a[1])) * itof(_lo(b[1])); 
+ suml; 
8); 


a[SIZE_A], b[SIZE_B]; 


dotp2((double *)a, (double *)b); 
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In Example 4—14, the dot product example is unrolled to maximize perfor- 
mance. The preprocessor is used to define convenient macros FHI() and 
FLO() for accessing the high and low 32-bit values in a double word. In this 
version of the loop, 8 float values are computed every 4 cycles. 
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Example 4—14. Float Dot Product With Peak Performance 


#define FHI(a) _itof(_hi(a)) 
#define FLO(a) _itof(_lo(a)) 


float dotp3(const double a[], const double b[]) 
{ 


int, i; 

float sum0 = 0; 

float suml = 0; 

float sum2 = 0; 

float sum3 = 0; 

float sum4 = 0; 

float sum5 = 0; 

float sum6 = 0; 

float sum7 = 0; 

float sum8 = 0; 

for (i=0; i<512; i+=4) 

{ 
sum0 += FHI (a[i]) * FHI (b[i]); 
suml += FLO(a[i]) * FLO(b[i]); 
sum2 += FHI (a[it1l]) * FHI(b[it+1]); 
sum3 += FLO(a[it1]) * FLO(b[it1]); 
sum4 += FHI (a[it+2]) * FHI (b[it2]); 
sum5 += FLO(a[it+2]) * FLO(b[it+2]); 
sum6 += FHI (a[it+3]) * FHI (b[it+3]); 
sum7 += FLO(a[it+3]) * FLO(b[it+3]); 


} 


sum0O += suml; 
sum2 += sum3; 
sum4 += sum5; 
sum6 += sum7; 
sum0O += sum2; 
sum4 += sum6; 


return sum0 + sum4; 
} 
void main() 
{ 
/* Using 0 as the bank parameter for the DATA_MEM_BANK */ 
/* pragma alings variable to a double word boundary for */ 
/* both C62xx and C67xx. */ 

#pragma DATA_MEM _BANK(a, 0); 

#pragma DATA_MEM_BANK (b, 0); 

float ret_val, a[SIZE_A], b[SIZE_B]; 

ret_val = dotp3((double *)a, (double *)b); 
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4.3.2.4 Using _nassert() and Word Accesses 


It is possible for the compiler to automatically perform SIMD optimizations for 
some, but notall loops. By either using global arrays, or by using the __nassert() 
intrinsic to provide alignment information about your pointers, the compiler can 
transform your code to use word accesses and the ‘C6x intrinsics. 


Example 4-15 shows how the compiler can automatically do this optimiza- 
tions. 


Example 4—15. Using the Compiler to Generate a Dot Product With Word Accesses 


int dotprodl(const short *a, const short *b, unsigned int N) 
{ 


int i, sum = 0; 


/* a and b are aligned to a word boundary */ 


nassert(((int) (a) & 0x3) == 0); 
nassert(((int) (b) & 0x3) == 0); 


_nassert (N == 40); 


for (i = 0; i < N; i++) 
sum += a[i] * b[il; 
return sum; 


} 
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Compile Example 4—15 with the following options: —o -mw -k. Open up the as- 
sembly file and look at the loop kernel. The results are the exact same as those 
produced by Example 4—9. The first 2 _nassert() intrinsics in Example 4-15 
tell the compiler that the arrays pointed to by a and b are aligned to a word 
boundary, so it is safe for the compiler to use a LDW instruction to load two 
short values. The compiler generates the _mpy() and _mpyh() intrinsics inter- 
nally as well as the two sums that were used in Example 4—9 (shown again be- 
low). 

int dotprod(const short *a, const short *b, 

unsigned int N) 
{ 
int i, suml = 0, sum2 = 0; 


const int *i_a (const int *)a; 


const int *i_b = (const int *)b; 

for (i = 0; i < (N >> 1); itt) { 
suml = suml + _mpy (i_a[i]l, i_bl[i]); 
sum2 = sum2 + _mpyh (i_a[i]l, i_b[il); 


} 


return suml + sum2; 


} 


You need some way to convey to the compiler that this loop will also execute 
an even number of times. The third intrinsic, nassert(N == 40), (refer to Sec- 
tion 4.3.3) conveys this information by telling the compiler that the loop will exe- 
cute exactly 40 times - an even number. (For more information on the _nassert 
intrinsic, refer to section 4.3.3.3, Communicating Trip Count Information to the 
Compiler). 


Example 4—16 and Example 4—17 show how to use the _nassert() intrinsic to 
get word accesses on the vector sum and the FIR filter. 
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Example 4—16. Using the _nassert() Intrinsic to Generate Word Accesses for Vector Sum 


void vecsum(short *sum, const short *inl, 
const short *in2, unsigned 
{ 
int i 
_nassert(((int)sum & 
_nassert(((int)inl & 
_nassert(((int)in2 & 


i < N; itt) 
inl[i] + in2[i]; 


Optimizing C Code 4-27 


Part Il 


Part Il 


Refining C Code 


Example 4-17. Using _nassert() Intrinsic to Generate Word Accesses for FIR Filter 


void fir (const short x[], const short h[], short y[] 
int npn; int m, int 3s) 


int 2, 3 

long yO; 

long round = 1L << (s - 1); 
((int)x & 0x3) == 0); 
((int)h & 0x3) == 0); 
((int)y & 0x3) == 0); 


nassert(n == 40); 


_nassert 


_nassert 


( 
( 
_nassert ( 
( 


for (3 ia i Se a 
{ 
= round; 
for (i = 0; i < nj itt) 
yO += x[i + 3] * hil; 
y[j] = (int) (yO >> s); 


As you can see from Example 4—17, the optimization done by the compiler is 
not as optimal as the code produced in Example 4—11, but it is more optimal 
than the code in Example 4-10. 


<compiler output from Example 4-17> 


Tas ; PIPED LOOP KERNEL 

[!BO] ADD JL1 A9,A7:A6,A7:A6 
MPY -M2X A8,B9,B3 
MPYHL -M1X B9,A0,A0 

[ Al] B /S2 is 
LDH .D2T2 *++B2(8),B9 
LDH .DIT1  *+A3(4),A8 

[!BO] ADD .L2 B9,B5:B4,B5:B4 
MPY -M1X AO,B1,A9 
LDW .D2T2 *+B8(4),B9 
LDH .D1T1  *+A3(6),A0 

[ BO] SUB .S2 BO,1,B0 

[!B0] ADD .L2 B3,B7:B6,B7:B6 

[!BO] ADD LL AO,A5:A4,A5:A4 
MPYHL -M2 B1,B9,B9 
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|| ({ Al] SUB .S1 Al,1,Al 
| | LDW .D2T2 *++B8(8),Bl 
| | LDH .D1T1  *++A3(8),A0 


<compiler output from Example 4-11> 


L3: ; PIPED LOOP KERNEL 

ADD .L2 B3,B5:B4,B5:B4 
ADD .L1 A3,A5:A4,A5:A4 
MV .S2 B1,B2 
MPY .M2X B1,A8,B3 
MPYHL .M1X B1,A8,A3 

[ Al] B .S1 L3 

[ BO] LDW .D2T2  ¥*B8,B1 

[ BO] SUB .S2 BO,1,B0 
ADD .L1 A3,A7:A6,A7:A6 
ADD .L2 B3,B7:B6,B7:B6 
MPYH .M1X B2,A8,A3 
MPYHL .M2X A8,B9,B3 

[ Al] SUB .S1 Al,1,Al 

[ BO] LDW .DIT1  *AO++,A8 

[ BO] LDW .D2T2  *++B8,B9 


<compiler output from Example 4-10> 


L4: ; PIPED LOOP KERNEL 

A2 SUB iS A2,1,A2 
ADD -L1 A5,A1:A0,A1:A0 
MPY -M1X B5,A4,A5 

BO B 282 L4 

BO SUB ~L2 BO,1,B0 

A2 LDH -D1T1 *A3++,A4 

A2 LDH -D2T2 *B4++,B5 


Note: 

The _nassert() intrinsic may not solve all of your short to int or float-to-double 
accesses, but it can be a useful tool in achieving better performance without 
rewriting the C code. Floating point code will not improve with the _nassert() 
intrinsic to try and force double word accesses. 
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If your code operates on global arrays as in Example 4—18, and you build your 
application with the -pm and -o03 options, the compiler will have enough infor- 
mation (trip counts and alignments of variables) to determine whether or not 
SIMD optimization is feasible. 


Example 4—18. Automatic Use of Word Accesses Without the _nassert Intrinsic 


<filel.c> 
int dotp (short *a, short *b, int c) 
{ 
int sum = 0, i; 
for (i ; i < c; i++) sum += ali] * bli]; 
return sum; 
} 
<file2.c> 
#include <stdio.h> 
short x[40] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
Li, 1£2,. 23, 24, 18, 16; 17, 18,19). 20, 
21, 22; 23, 24; 25, 26; 27; 28; 29, 30, 
31, 32, 33, 34, 35; 36, 37, 38, 39, 40 };7 


short y[40] = { 40, 39, 38, 37, 36, 35, 34, 33, 32, 
30, 29; 28, 27; 26, 25; 24, 23; 22; 21, 
20;,° 19,. 28, L7,. 16, 25,> 145. .13,;. 22;- 21, 
10; 9, 8, 7, 6; S,; 4, 3, 2; 1} 

void main() 


{ 


anit. <Z;9 
z = dotp(x, y, 40); 
printf(“z = Sd\n", 2); 
} 
Compile filel.c and file2.c with: 
cl6x -pm -o03 -k -mw filel.c file2.c 
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Edit the resulting assembly file (file1.asm). Notice that the dot product loop 
uses word accesses and the ‘C6x intrinsics. 


Ti2 , PIPED LOOP KERNEL 

[!A1] ADD 12 B6,B7,B7 

[!Al] ADD a A6,A0,A0 
MPY .M2X B5,A4,B6 
MPYH .M1X B5,A4,A6 

[ BO] B et 12 
LDW .DIT1 *+A5(4),A4 
LDW .D2T2 *+B4(4),B6 

[ Al] SUB sed. Pa te 

[!Al] ADD .82 B5,B8,B8 

[!Al] ADD ta A6,A3,A3 
MPY .M2X B6,A4,B5 
MPYH .M1X B6,A4,A6 

[ BO] SUB ae BO,1,B0 
LDW .DIT1  *4++A5(8),A4 
LDW .D2T2  *++B4(8),B5 
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4.3.3 Software Pipelining 


Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations of the loop execute in parallel. When you use the —02 
and —03 compiler options, the compiler attempts to software pipeline your 
code with information that it gathers from your program. 


Figure 4-2 illustrates a software-pipelined loop. The stages of the loop are 
represented by A, B, C, D, and E. In this figure, a maximum of five iterations 
of the loop can execute at one time. The shaded area represents the loop ker- 
nel. Inthe loop kernel, all five stages execute in parallel. The area immediately 
before the kernel is known as the pipelined-loop prolog, and the area immedi- 
ately following the kernel is known as the pipelined-loop epilog. 


Figure 4—3. Software-Pipelined Loop 
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Al 
B1 A2 
C1 B2 A3 Pipelined-loop prolog 


D1 C2 B3 A4 


E1 D2 C3 B4 AS Kernel 
E2 D3 C4 BS 


E3 D4 C5 Pipelined-loop epilog 
E4 D5 
E5 


Because loops present critical performance areas in your code, consider the 
following areas to improve the performance of your C code: 


(j Trip count 

J Redundant loops 

(j Loop unrolling 

1 Speculative execution 
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4.3.3.1 Trip Count Issues 


A trip count is the number of times that a loop executes; the trip counter is the 
variable used to count each iteration. When the trip counter reaches a limit 
equal to the trip count, the loop terminates. The structure of a software pipeline 
requires the execution of a minimum number of loop iterations (a minimum trip 
count) in order to fill, or prime, the pipeline. 


Loops that are eligible for software pipelining have loop trip counters that count 
down. In most cases, the compiler can transform the loop to use a trip counter 
that counts down even if the original code was not written that way. 


For example, the optimizer at levels -o2 and —o3 transforms the loop in 
Example 4—19(a) to something like the code in Example 4—19(b). 


Example 4—19. Trip Counters 
(a) Original code 


for (i = 0; i < Nj; it+) /* i = trip counter, N = trip count */ 


(b) Optimized code 


for (i = N; i != 0; i--) /* Downcounting trip counter */ 


The minimum trip count for a software pipelined loop is determined by the mini- 
mum number of times the loop will execute. 


If the compiler knows the lower bound on the trip count (and in some cases, 
the upper bound), it can generate faster and more compact code. If the 
compiler cannot determine that a loop always executes for the minimum trip 
count, it generates a redundant unpipelined loop. The redundant unpipelined 
loop is executed only when the runtime trip count is less than the minimum trip 
count; otherwise, the software-pipelined version of the loop is executed. 
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4.3.3.2 Eliminating Redundant Loops 


In Example 4-2 on page 4-9, the compiler cannot determine if the loop 
always executes more than the minimum trip count. Therefore, it generates 
two versions of the loop: 


1 An unpipelined version that executes if N is less than the minimum trip 
count (in this case, the minimum trip count equals 2) 


(1 Asoftware-pipelined version that executes if N is equal to or greater than 
the minimum trip count 


To indicate to the compiler that you do not want two versions of the loop, you 
can use the —ms0 option so that the compiler generates only the software-pi- 
pelined code and never generates a redundant loop; however, loops with an 
unknown trip count, or where the trip count is less than the minimum trip count, 
are not software pipelined. 


4.3.3.3. Communicating Trip-Count Information to the Compiler 
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When invoking the compiler, use the following options to communicate trip- 
count information to the compiler: 


(1 Use the-—o03 and—pm compiler options to allow the optimizer to access the 
whole program or large parts of it and to characterize the behavior of loop 
trip counts. 


(J) Use the _nassert intrinsic to help reduce code size by preventing the 
generation of a redundant loop or by allowing the compiler (with or without 
the —ms option) to software pipeline innermost loops. 


You can use the _nassert intrinisc to convey many different types of informa- 
tion about the trip count to the compiler. 

(j It can convey that the trip count will always equal some value. 

/* This loop will always execute exactly 30 times */ 
_nassert (x == 30); 


for (j = O07 3 < x} J++) 
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LJ) Itcanconvey that the trip count will be greater than some minimum value 
or smaller than some maximum value. The latter is useful when interrupts 
need to occur inside of loops and you are using the -mi<n> option. Refer 
to section 8.4, Interruptible Loops. 


/* This loop will always execute at least 30 times */ 


_nassert (x >= 30); 


for (j = 0; j < x; Jtt) 
(J It can convey that the trip count is always divisible by a value. 


/* The trip count will execute some multiple of 4 times */ 


_nassert((x % 4) == 0); 


for (j = 0; 3 < xF Jtt) 
LJ It can convey information about the alignment of pointers and arrays. 


void vecsum(short *a, const short *b, const short *c) 
{ 


_nassert(((int) a & 0x3) == 0); 
_nassert(((int) b & 0x3) == 0); 
_nassert(((int) c & 0x3) == 0); 


} 

This information call all be combined as well into a single C statement: 
_nassert((x >= 8) && (x <= 48) && ((x % 8) == 0)); 
for (j = 0; Jj < x; Jtt) 


The compiler knows that this loop will execute some multiple of 8 (between 8 
and 48) times. This information is useful in providing more information about 
unrolling a loop or the ability to perform word accesses on a loop. 


Several examples in this chapter and in section 8.4.4 show all of the different 
ways that the _nassert intrinsic can be used. 


See the TMS320C6x Optimizing C Compiler User’s Guide for a complete 
discussion of the —ms, —03, and —pm options and the _nassert intrinsic. 
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4.3.3.4 Loop Unrolling 


Another technique that improves performance is unrolling the loop; that is, ex- 
panding small loops so that each iteration of the loop appears in your code. 
This optimization increases the number of instructions available to execute in 
parallel. You can use loop unrolling when the operations in a single iteration 
do not use all of the resources of the ’C6x architecture. 


In Example 4—20, the loop produces a new sun\i] every two cycles. Three 
memory operations are performed: a load for both in1[i] and in2[i] and a store 
for sum[i]. Because only two memory operations can execute per cycle, two 
cycles are necessary to perform three memory operations. 


Example 4-20. Vector Sum With Three Memory Operations 


void vecsum2 (short *sum, const short *inl, 


{ 


const short *in2, unsigned int N) 


int i; 
for (1 = O; 32 < Ns i++) 
sum[i] = inl[i] + in2[i]; 


The performance of a software pipeline is limited by the number of resources 
that can execute in parallel. In its word-aligned form (Example 4—21), the vec- 
tor sum loop delivers two results every two cycles because the two loads and 
the store are all operating on two 16-bit values at a time. 


Example 4-21. Word-Aligned Vector Sum 


void vecsum4(short *sum, const short *inl, 


{ 


const short *in2, unsigned int N) 


int i; 
const int *i_inl = (const int *)inl; 
const int *i_in2 = (const int *)in2; 
int *i_sum = (int *) sum; 
_nassert(N >= 20); 
for (i = 0; i < (N/2); itt) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 
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If you unroll the loop once, the loop then performs six memory operations per 
iteration, which means the unrolled vector sum loop can deliver four results 
every three cycles (that is, 1.33 results per cycle). Example 4—22 shows four 
results for each iteration of the loop: sum[i] and sum[i+sz] each store an int 
value that represents two 16-bit values. 


Example 4—22 is not simple loop unrolling where the loop body is simply repli- 
cated. The additional instructions use memory pointers that are offset to point 
midway into the input arrays and the assumptions that the additional arrays are 
a multiple of four shorts in size. 


Example 4—22. Vector Sum Using const Keywords, _nassert, Word Reads, and 
Loop Unrolling 


void vecsum6(int *sum, const int *inl, const int *in2, unsigned int N) 


{ 


int i; 

int sz = N >> 2; 

_nassert(N >= 20); 

for (i = 0; i < sz; i++) 

{ 
sum[i] = _add2(inl[i], in2[i]); 
sum[it+sz] = _add2(inl[i+sz], in2[i+sz]); 


Software pipelining is performed by the compiler only on inner loops; there- 
fore, you can increase performance by creating larger inner loops. One 
method for creating large inner loops is to completely unroll inner loops that 
execute for a small number of cycles. 


In Example 4—23, the compiler pipelines the inner loop with a kernel size of one 
cycle; therefore, the inner loop completes a result every cycle. However, the 
overhead of filling and draining the software pipeline can be significant, and 
other outer-loop code is not software pipelined. 
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Example 4-23. FIR_Type2— Original Form 


void fir2(const short input[], const short coefs[], short out[]) 


{ 


} 


int: 2, ¢ 
int sum = 0; 


for (i = 0; i < 40; i++) 
{ 
for (j = 0; 3 < 16; Jjtt) 
sum += coefs[j] * input[i + 15 - jl]; 


out[i] = (sum >> 15); 


} 
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For loops with a simple loop structure, the compiler uses a heuristic to deter- 
mine if it should unroll the loop. Because unrolling can increase code size, in 
some cases the compiler does not unroll the loop. If you have identified this 
loop as being critical to your application, then unroll the inner loop in C code, 
as in Example 4—24. 


In general unrolling may be a good idea if you have an uneven partition or if 
your loop carried dependency bound is greater than the partition bound. (Refer 
to section 7.7, Loop Carry Paths and section 3.2 in the TMS320C6x Optimizing 
C Compiler User’s Guide. This information can be obtained by using the -mw 
option and looking at the comment block before the loop. 


Example 4-24. FIR_Type2—Inner Loop Completely Unrolled 
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void fir2_u(const short input[], const short coefs[], short out[]) 


{ 


int a, J; 

int sum; 

for (i = 0; i < 40; itt) 

{ 
sum = coefs[0] * input[i + 15]; 
sum += coefs[1] * input[i + 14]; 
sum += coefs[2] * input[i + 13]; 
sum += coefs[3] * input[i + 12]; 
sum += coefs[4] * input[i + 11]; 
sum += coefs[5] * input[i + 10]; 
sum += coefs[6] * input[i + 9]; 
sum += coefs[7] * input[i + 8]; 
sum += coefs[8] * input[i + 7]; 
sum += coefs[9] * input[i + 6]; 
sum += coefs[10] * input[i + 5]; 
sum += coefs[11] * input[i + 4]; 
sum += coefs[12] * input[i + 3]; 
sum += coefs[13] * input[i + 2]; 
sum += coefs[14] * input[i + 1]; 
sum += coefs[15] * input[i + 0]; 
out[i] = (sum >> 15); 


Now the outer loop is software-pipelined, and the overhead of draining and 
filling the software pipeline occurs only once per invocation of the function 
rather than for each iteration of the outer loop. 


The heuristic the compiler uses to determine if it should unroll the loops needs 
to know either of the following pieces of information. Without knowing either 
of these the compiler will never unroll a loop. 


_j) The exact trip count of the loop 
_j Or that the trip count of the loop is some multiple of two 


The second requirement can be passed to the compiler through the _nassert 
intrinsic. In section 4.3.3.3, Communicating Trip-Count Information to the 
Compiler, it is explained that _nassert can be used to provide information 
about loop unrolling. By using the modulus operator, you can specify that the 
trip count is a multiple or power or two. 


_nassert((n % 2) == 0); 
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Example 4-25 shows how the compiler can perform simple loop unrolling of 
replicating the loop body. The _nassert intrinsics tell the compiler that the loop 
will execute an even number of times greater than 20. This compiler will unroll 
the loop once to take advantage of the performance gain that results from the 
unrolling. 


Example 4-25. Vector Sum 


void func(short *a, const short *b, const short *c, int n) 


{ 
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Tit. 15% 
_nassert ( 
for (i = 0 


((n 2) (n >= 20)); 
roi n; afi] = b[i] + clil; 


<compiler output for above code> 

L2: ; PIPED LOOP KERNEL 

.L1X B4,A3,A3 

[ BO] Sl L2 

.D1T1 *++A4 (4) ,A3 
»D2T2 *++B5 (4) ,B4 


-D1T1 A3,*++A0 (4) 
-L2X Bo,A5,B6é 
~D2T2 *+B5(2),B6 


-L1 Al1,1,Al1 
-D2T2 Bo, *++B7 (4) 
L2 BO,1,B0 
-D1T1 *+A4 (2) ,A5 


Note: When the interrupt threshold option is used, unrolling can be used to re- 
gain lost performance. Refer to section 8.4.4 Getting the Most Performance 
Out of Interruptible Code. 
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4.3.3.5 Speculative Execution (—-mh option) 


The —mh option eliminates the epilog for a software pipelined loop, which can 
result in significant code size savings. Software pipelined loop epilogs can 
often be eliminated if load instructions can be speculatively executed. An in- 
struction is speculatively executed if it is executed before it is known whether 
the result of the instruction is needed. Allowing speculative execution of load 
instructions may result in a read past the beginning or end of a buffer. For a 
complete discussion on the —mh option see the TMS320C6x Optimizing C 
Compiler User’s Guide. 


4.3.3.6 Software Pipelining Retry (-mx option) 


Use the —mx option whenever you are concerned about getting the best pos- 
sible pipelined schedule out of a loop. Using the —mx option tells the compiler 
to take more time to search for other possible schedules of the loop. The com- 
piler will select the best version of the pipelined loop and generate assembly 
instructions for that version, based on the trip count information about the loop. 
Since the compiler needs as much information about the loop as possible, it 
is recommended that you use the —03 and —pm options with —mx, or use the 
_nassert intrinsics to describe the trip count characteristics of your important 
loops. 


4.3.3.7 What Disqualifies a Loop from Being Software-Pipelined 


In a sequence of nested loops, the innermost loop is the only one that can be 
software-pipelined. The following restrictions apply to the software pipelining 
of loops: 


[J Although a software pipelined loop can contain intrinsics, it cannot contain 
function calls, including code that will call the run-time support routines. 


for (i = 0; i < 100; i++) 
x [as S202] 2.5% 
This will call the run-time support _remi routine. 


[J You must not have a conditional break (early exit) in the loop. You need 
to rewrite your code to use if statements instead. Use the if statements 
only around code that updates memory (stores to pointers and arrays) and 
around variables whose values calculated inside the loop and are used 
outside the loop. Also, do not nest if statements. The compiler cannot soft- 
ware pipeline a loop that contains nested if statements. Example 4-26 
shows how to combine the nested conditions using && into one if condi- 
tion. 


Optimizing C Code 4-41 


Part Il 


Part Il 


Refining C Code 


Example 4-26. Use of If Statements in Float Collision Detection 


Original Code 


int colldet(const float *x, const float *p, float point, 
float distance) 
{ 
int I, retval = 0; 
float sum0, suml, dist0O, distl; 
for (I = 0; I < (28 * 3); I += 6) 
{ 
sum0 = x[I+0]*p[0] + x[I+1]*p[1] + x[1I+2]*p[2]; 
suml = x[1I+3]*p[0] + x[1+4]*p[1] + x[1I+5]*p[2]; 
distO = sum0 - point; 
distl suml - point; 
disto fabs (dist0); 
distl fabs (distl1); 
if (distO < distance) 


retval = (int) é&x[I + 0]; 
break; 


if (distl < distance) 


retval = (int)é&x[I + 3]; 
break; 


} 


return retval; 


} 
(b) Modified Code 

int colldet_new(const float *x, const float *p, float point, 
float distance) 


{ 


int I, retval = 0; 
float sum0, suml, dist0O, distl; 
for (I = 0; I < (28 * 3); I += 6) 
{ 
sum0 = x[I+0]*p[0] + x[I+1]*p[1] + x[1+2]*p[2]; 
suml = x[1I+3]*p[0] + x[1I+4]*p[1] + x[1I+5]*p[2]; 
distO = sum0 - point; 
distl = suml - point; 
distO = fabs(dist0); 
distl = fabs(distl); 
if ((dist0O<distance) &é&!retval) retval = (int) &x[I+0]; 
if ((distl<distance) &é&!retval) retval (int) &x[I+3]; 


retval; 


(1 The loop cannot have an incrementing loop counter. Run the optimizer 
with the —o2 or —03 option to convert as many loops as possible into down- 
counting loops. 
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If the trip counter is modified within the body of the loop, it typically cannot 
be converted into a downcounting loop. If possible, rewrite the loop to not 
modify the trip counter. For example, the following code will not software 
pipeline: 


for (i = 0; i < nj; i++) 


Aconditionally incremented loop control variable is not software pipelined. 
Again, if possible, rewrite the loop to not conditionally modify the trip coun- 
ter. For example the following code will not software pipeline: 


for (i = 0; i < x; i++) 
{ 

if (b > a) 

i += 2 

} 
If the code size is too large and requires more than the 32 registers in the 
‘C6x, itis not software pipelined. Either try to simplify the loop or break the 
loop up into multiple smaller loops. 


If a register value is live too long, the code is not software-pipelined. See 
section 7.6.6.2, Live Too Long, on page 7-68 and section 7.10, Live-Too- 
Long Issues, on page 7-102 for examples of code that is live too long. 


If the loop has complex condition code within the body that requires more 
than the five ’C6x condition registers, the loop is not software pipelined. 
Try to eliminate or combine these conditions. 


Optimizing C Code 4-43 


Part Il 


wed 


4-44 


Part / 
Introduction 


Part Il 


C Code 


Part Il 
Assembly Code 


Part IV 


Appendix 


I Hed 


Chapter 5 


Linking Issues 


This chapter contains useful information about other problems and questions 
that might arise while building your projects, including: 


_j What to do with the relocation value truncated linker and assembler mes- 
sages 


J How to save on-chip memory by moving the RTS off-chip 


_) How to build your application with RTS calls either near or far 


_) How to change the default RTS data from far to near 
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5.1 How to Use Linker Error Messages 
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When you try to call a function which, due to how you linked your application, 
is too far away from acall site to be reached with the normal PC-relative branch 
instruction, you will see the following linker error message: 

>> PC-relative displacement overflow. Located in file.obj, 


section .text, SPC offset 000000bc 


This message means that in the named object file in that particular section, is 
a PC-relative branch instruction trying to reach a call destination that is too far 
away. The SPC offset is the section program counter (SPC) offset within that 
section where the branch occurs. For C code, the section name will be .text 
(unless a CODE_SECTION pragma is in effect). 


You might also see this message in connection with an MVK instruction: 


>> relocation value truncated at Oxa4 in section .text, 
file file.obj 


Or, an MVK can be the source of this message: 


>> Signed 16-bit relocation out of range, value truncated. 
Located in file.obj, section .text, SPC offset 000000a4 


These messages are similar. The file is file.obj, the section is .text, and the 
SPC offset is Oxa4. If this happens to you when you are linking C code, here 
is what you do to find the problem: 


[1 Recompile the C source file as you did before but include —s —al in the op- 
tions list 


cl6x <other options> -s -al file.c 


This will give you C interlisted in the assembly output and create an assembler 
listing file with the extension .Ist. 


(1 Edit the resulting .Ist file, in this case file.|st. 


(4 Each line in the assembly listing has several fields. For a full description 
of those fields see section 3.10 of the TMS320C6x Assembly Language 
Tools User’s Guide. The field you are interested in here is the second one, 
the section program counter (SPC) field. Find the line with the same SPC 
field as the SPC offset given in the linker error message. It will look like: 


245 000000bc OFFFEC10! B .S1 _atoi ; |56| 


In this case, the call to the function atoi is too far away from the location where 
this code is linked. 


It is possible that use of —s will cause instructions to move around some and 
thus the instruction at the given SPC offset is not what you expect. The branch 
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or MVK nearest to that instruction is the most likely cause. Or, you can rebuild 
the whole application with —s —al and relink to see the new SPC offset of the 
error. 


If you are tracing a problem in a hand-coded assembly file, the process is simi- 
lar, but you merely re-assemble with the -| option instead of recompiling. 


To fix a branch problem, your choices are: 


_} Use the —mr1 option to force the call to atoi, and all other RTS functions, 
to be far. 


_) Compile with —ml1 or higher to force all calls to be far. 


.) Rewrite your linker command file (looking at a map file usually helps) so 
that all the calls to atoi are close (within 0x100000 words) to where atoi is 
linked. 


If the problem instruction is an MVK, then you need to understand why the 
constant expression does not fit. 


For C code, you might find the instruction looks like: 
50 000000a4 0200002A% MVK (_ary-$bss),B4 ; |5| 


In this case, the address of the C object ary is being computed as if ary is de- 
clared near (the default), but because it falls outside of the 15-bit address 
range the compiler presumes for near objects, you get the warning. To fix this 
problem, you can declare ary to be far, or you can use the correct cl6x —ml<n> 
memory model option to automatically declare ary and other such data objects 
to be far. See chapter 2 of the TMS320C6x Optimizing C Compiler User’s 
Guide for more information on —ml<n>. 


It is also possible that ary is defined as far in one file and declared as near in 
this file. In that case, insure ary is defined and declared consistently to all files 
in the project. 

If the MVK instruction is just a simple load of an address: 

123 000000a4 0200002A! MVK sym, B4 


Then the linker warning message is telling you that sym is greater than 32767, 
and you will end up with something other than the value of sym in B4. In most 
cases, this instruction is accompanied by: 


124 000000a8 0200006A! MVKH sym, B4 
When this is the case, the solution is to change the MVK to MVKL. 


On any other MVK problem, it usually helps to look up the value of the sym- 
bol(s) involved in the linker map file. 
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5.1.1 


5-4 


Executable Flag 


You may also see the linker message: 
>> warning: output file file.out not executable 


If this is due solely to MVK instructions, paired with MVKH, which have yet to 
be changed to MVKL, then this warning may safely be ignored. The loaders 
supplied by TI will still load and execute this .out file. 


If you implement your own loader, please be aware this warning message 
means the F_EXEC flag in the file header is not set. If your loader depends on 
this flag, then you will have to fix your MVK instructions, or use the switches 
described above to turn off these warnings. 
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5.2 How to Save On-Chip Memory by Placing RTS Off-Chip 


One of many techniques you might use to save valuable on-chip space is to 
place the code and data needed by the runtime-support (RTS) functions in off- 
chip memory. 


Placing the RTS in off-chip memory has the advantage of saving valuable on- 
chip space. However, itcomes at acost. The RTS functions will run much slow- 
er. Depending on your application, this may or may not be acceptable. It is also 
possible your application doesn’t use the RTS library much, and placing the 
RTS off-chip saves very little on-chip memory. 


Table 5-1. Definitions 


Term 


Normal RTS 
functions 


Internal RTS 
functions 


near calls 


far calls 


Means 


Ordinary RTS functions. Example: strcpy 


Functions which implement atomic C operations such as divide or floating point math on the 
C62xx. Example: _divu performs 32-bit unsigned divide. 


Function calls performed with a ordinary PC-relative branch instruction. The destination of 
such branches must be within 1 048 576 (0x100000) words of the branch. Such calls use 1 
instruction word and 1 cycle. 


Function calls performed by loading the address of the function into a register and then 
branching to the address in the register. There is no limit on the range of the call. Such calls 
use 3 instruction words and 3 cycles. 


5.2.1 How to Compile 


Make use of shell (cl6x) options for controlling how RTS functions are called: 


Table 5—2. Command Line Options for RTS Calls 


Option Internal RTS calls Normal RTS calls 
Default Same as user Same as user 
—mr0 Near Near 

—mr1 Far Far 


By default, RTS functions are called with the same convention as ordinary 
user-coded functions. If you do not use a—ml<n> option to enable one of large- 
memory models, then these calls will be near. The option —mr0 causes calls 
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to RTS functions to be near, regardless of the setting of the —-ml<n> switch. 
This option is for special situations, and typically isn’t needed. The option —mr1 
will cause calls to RTS functions to be far, regardless of the setting of the — 
ml<n> switch. 


Note these options only address how RTS functions are called. Calling func- 
tions with the far method does not mean those functions must be in off-chip 
memory. It simply means those functions can be placed at any distance from 
where they are called. 


5.2.2 Must #include Header Files 


5.2.3. RTS Data 
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When you call a RTS function, you must include the header file which corre- 
sponds to that function. For instance, when you call memcmp, you must #in- 
clude <string.h>. If you do not include the header, the memcmp call looks like 
anormal user call to the compiler, and the effect of using —-mr1 does not occur. 


Most RTS functions do not have any data of their own. Data is typically passed 
as arguments or through pointers. However, a few functions do have their own 
data. All of the ”is<xxx>” character recognition functions defined in ctype.h re- 
fer to a global table. Also, many of the floating point math functions have their 
own constant look-up tables. All RTS data is defined to be far data, for exam- 
ple, accessed without regard to where itis in memory. Again, this does not nec- 
essarily mean this data is in off-chip memory. 


Details on how to change access of RTS data are given in section 5.2.7 
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5.2.4 How to Link 


You place the RTS code and data in off-chip memory through the linking pro- 
cess. Here is an example linker command file you could use instead of the 
Ink.cmd file provided in the \lib directory. 


/* farlnk.cmd - Link command file which puts RTS off-chip 

[RRR KR KK KK KK KK KR OK RK KK KK RK KK OK OK OK OK OK OK KK OK KK / 
=¢ 

-heap 0x2000 

-stack 0x4000 


/* Memory Map 1 - the default */ 
MEMORY 
{ 


= 00000000h = 00010000h 
= 00400000h = 01000000h 
01400000h 00400000h 
02000000h 01000000h 
= 03000000h = 01000000h 
= 80000000h = 00010000h 


ECTIONS 


/* a - 
/* Sections defined only in RTS. 


/* 
-stack 
.sysmem 
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.-switch 
.far 


/* - ~---*/ 
/* All of .cinit, including from RTS, must be collected together */ 
/* in one step. 
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/* RTS code - placed off chip 


/* 


-rtstext { -lrts6201.lib(.text) 


/* = = 


/* RTS data - undefined sections - placed off chip 


/*------ = 


.rtsbss { -lrts6201.lib(.bss) 
-lrts6201.lib(.far) } EXTO 


/* 


/* RTS data - defined sections - placed off chip 


/* = = — 


-rtsdata { -lrts6201,1ib(.const) 


-lrts6201.1lib(.switch) } 


User sections (.text, .oss, .const, .data, .switch, .far) are built and allocated 
normally. 


The .cinit section is built normally as well. Itis important to not allocate the RTS 
.cinit sections separately as is done with the other RTS sections. All of the .cinit 
sections must be combined together into one section for auto-initialization of 
global variables to work properly. 


The .stack, .sysmem, and .cio sections are entirely created from within the 
RTS. So, you don’t need any special syntax to build and allocate these sec- 
tions separately from user sections. Typically, you place the .stack (system 
stack) and .sysmem (heap of memory used by malloc, etc.) sections in on-chip 
memory for performance reasons. The .cio section is a buffer used by printf 
and related functions. You can typically afford slower performance of such I/O 
functions, so it is placed in off-chip memory. 


The .rtstext section collects all the .text, or code, sections from RTS and allo- 
cates them to external memory name EXTO. If needed, replace the library 
name rts6201.lib with the library you normally use, perhaps rts6701.lib. The 
—lis required, and no space is allowed between the —| and the name of the libra- 
ry. The choice of EXTO is arbitrary. Use the memory range which makes the 
most sense in your application. 


The .rtsbss section combines all of the undefined data sections together. Un- 
defined sections reserve memory without any initialization of the contents of 
that memory. You use .bss and .usect assembler directives to create unde- 
fined data sections. 
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The .rtsdata section combines all of the defined data sections together. De- 
fined data sections both reserve and initialize the contents of a section. You 
use the .sect assembler directive to create defined sections. 


It is necessary to build and allocate the undefined data sections separately 
from the defined data sections. When a defined data section is combined to- 
gether with an undefined data section, the resulting output section is a defined 
data section, and the linker must fill the range of memory corresponding to the 
undefined section with a value, typically the default value of 0. This has the un- 
desirable effect of making your resulting .out file much larger. 


You may get a linker warning like: 


>> farlnk.cmd, line 65: warning: rts6201.lib(.switch) not 
found 


That means none of the RTS functions needed by your application define a 
switch section. Simply delete the corresponding —| entry in the linker com- 
mand file to avoid the message. If your application changes such that you later 
do include an RTS function with a .switch section, it will be linked next to the 
.switch sections from your code. This is fine, except it is taking up that valuable 
on-chip memory. So, you may want to check for this situation occasionally by 
looking at the linker map file you create with the linker —m option. 


5.2.5 Example Compiler Invocation 


A typical build could look like: 
cl6x -mrl <other options> <C files> -z -o app.out 
-m app.map farlnk.cmd 


Inthis one step you both compile all the C files and link them together. The C6x 
executable image file is named app.out and the linker map file is named 


app.map. 


Refer to section 4.4.1 to learn about the linker error messages when calls go 
beyond the PC relative boundary. 


5.2.6 Header File Details 


Look at the file linkage.h in the \include directory of the release. Depending on 
the value of the _FAR_RTS macro, the macro CODE ACCESS is set to force 
calls to RTS functions to be either user default, near, or far. The FAR _RTS 
macro is set according to the use of the -mr<n> switch. 
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Table 5-3. How __FAR_RTS is Defined in Linkage.h With —mr 


Option Internal RTS calls Normal RTS calls _FAR_RTS 
Default Same as user Same as user Undefined 
—mr0 Near Near 0 


—mr1 Far Far 1 


The _DATA_ACCESS macro is set to always be far. 


The _IDECL macro determines how inline functions are de- 
clared. 


All of the RTS header files which define functions or data include linkage.h 
header file. Functions are modified with CODE ACCESS: 


extern _CODE_ACCESS void exit(int _status); 
and data is modified with DATA ACCESS: 


extern _DATA_ACCESS unsigned char _ctypes_[]; 


5.2.7 Changing RTS Data to near 


If for some reason you do not want accesses of RTS data to use the far access 
method, take these steps: 


1 Go to the \include directory of the release. 


(Jj Edit linkage.h, and change the: 
#define _DATA_ACCESS far 

macro to 
#define _DATA_ACCESS near 


to force all access of RTS data to use near access, or 
change it to 


#define _DATA ACCESS 


if you want RTS data access to use the same method used when accessing 
ordinary user data. 


(41 Copy linkage.h to the \lib directory. 
_j Go to the \lib directory. 


(j Replace the linkage.h entry in the source library: 


ar6x -r rts.sre linkage.h 


(j Delete linkage.h. 
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_j) Rename or delete the object library you use when linking. 


_) Rebuild the object library you use with the library build command listed in 
the readme file for that release. 


Note that you will have to perform this process each time you install an update 
of the code generation toolset. 
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Structure of Assembly Code 


An assembly language program must be an ASCII text file. Any line of 
assembly code can include up to seven items: 


_j Label 

L) Parallel bars 

_j Conditions 

_j Instruction 

LJ Functional unit 

_} Operands 

_j Comment 
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6.1 Labels 


A label identifies a line of code or a variable and represents a memory address 
that contains either an instruction or data. 


Figure 6—1 shows the position of the label in a line of assembly code. The colon 
following the label is optional. 


Figure 6—1. Labels in Assembly Code 


6.2 Parallel Bars 


label: parallel bars [condition] instruction unit operands ; comments 


Labels must meet the following conditions: 


(1 The first character of a label must be a letter or an underscore (_) followed 
by a letter. 


(1 The first character of the label must be in the first column of the text file. 


(j Labels can include up to 32 alphanumeric characters. 


An instruction that executes in parallel with the previous instruction signifies 
this with parallel bars (||). This field is left blank for an instruction that does not 
execute in parallel with the previous instruction. 


Figure 6-2. Parallel Bars in Assembly Code 
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label: parallelbars [condition] instruction unit operands ; comments 


6.3 Conditions 


Conditions 


Five registers in the ’C6x are available for conditions: A1, A2, BO, B1, and B2. 
Figure 6-3 shows the position of a condition in a line of assembly code. 


Figure 6-3. Conditions in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


All ’C6x instructions are conditional: 


_] 
_] 


If no condition is specified, the instruction is always performed. 

If a condition is specified and that condition is true, the instruction 

executes. For example: 

With this condition... The instruction executes if ... 

[Al] A1!=0 

[!A1] A1=0 

If a condition is specified and that condition is false, the instruction does 

not execute. 

With this condition... |The instruction does not execute if ... 

[A1] A1 =0 = 
E 

[!A1] A1!=0 ra 
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Instructions 


Assembly code instructions are either directives or mnemonics: 


(1 Assembler directives are commands for the assembler (asm6x) that 
control the assembly process or define the data structures (constants and 
variables) in the assembly language program. All assembler directives 
begin with a period, as shown in the partial list in Table 6-1. 


[J Processor mnemonics are the actual microprocessor instructions that 
execute at runtime and perform the operations in the program. Table 6-2 
summarizes the ’C6x mnemonics. Processor mnemonics must begin in 
column 2 or greater. 


Figure 6—4 shows the position of the instruction in a line of assembly code. 


Figure 6-4. Instructions in Assembly Code 


label: parallel bars [condition] instruction unit operands ; comments 


Table 6—1. Selected TMS320C6x Directives 


Directives Description 
sect “name” Creates section of information (data or code) 
.double value Reserve two consecutive 32 bits (64 bits) in memory and 


fill with double-precision (64-bit) IEEE floating-point rep- 
resentation of specified value 


float value Reserve 32 bits in memory and fill with single-precision 
(32-bit) IEEE floating-point representation of specified 
value 

-int value Reserve 32 bits in memory and fill with specified value 

long value 

.word value 

short value Reserve 16 bits in memory and fill with specified value 

-half value 

byte value Reserve 8 bits in memory and fill with specified value 


See the TMS320C6x Assembly Language Tools User’s Guide for a complete 
list of directives. 


Table 6-2. Selected TMS320C6x Instruction Mnemonics 


Arithmetic 


ABS 

ADD 
ADDA 
ADDK 
ADDPt 
ADDSPt 
ADD2 
DPINTT 
DPsPt 
DPTRUNCT 
INTDPT 
INTSPT 
RCPDPT 
RCPSPt 
RSQRDPt 
RSQRSPt 
SADD 
SAT 
SPDPt 
SPINTT 
SPTRUNCT 
SSUB 
SUB 
SUBA 
SUBC 
SUBDPT 
SUBSPT 
SUB2 


Program 
Multiply Load/Store Control 
MPY LD B 
MPYDPt LDDwt BIRP 
MPYH MVK B NRP 
MPYHL MVKH 
MPYIt ST 
MPYIDt 
MPYLH 
MPYSPT 
SMPY 


Tt 'C67x instruction mnemonics only 


Bit 
Management 


CLR 
EXT 
LMBD 
NORM 
SET 


Logical 


AND 
CMPEQ 
CMPEQDPt 
CMPEQSPT 
CMPGT 
CMPGTDPTt 
CMPGTSPt 
CMPLT 
CMPLTDPt 
CMPLTSPTt 


Instructions 


Pseudo/Other 


IDLE 
MV 
MVC 
NOP 
ZERO 
NEG 
NOT 
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See the TMS320C62x/C67x CPU and Instruction Set Reference Guide for a 


complete list of instructions. 
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6.5 Functional Units 


The ’C6x CPU contains eight functional units, which are shown in Figure 6-5 
and described in Table 6-3. 


Figure 6-5. TMS320C6x Functional Units 
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Functional Units 


Table 6-3. Functional Units and Descriptions 


Functional Unit 


Description 


-L unit (.L1, .L2) 


.S unit (.S1, .S2) 


-M unit (.M1, .M2) 


-D unit (.D1, .D2) 


T’C67x floating-point devices only 


32/40-bit arithmetic and compare operations 
Left most 1, 0, bit counting for 32 bits 
Normalization count for 32 and 40 bits 

32 bit logical operations 


32/64-bit IEEE floating-point arithmetict 
Floating-point/fixed-point conversionst 


32-bit arithmetic operations 

32/40 bit shifts and 32-bit bit-field operations 

32 bit logical operations 

Branching 

Constant generation 

Register transfers to/from the control register file 


32/64-bit IEEE floating-point compare operationst 
32/64-bit IEEE floating-point reciprocal and square root 
reciprocal approximationt 


16 x16 bit multiplies 


32 x 32-bit multipliest 
Single-precision (32-bit) floating-point IEEE multipliest 
Double-precision (64-bit) floating-point IEEE multipliest 


32-bit add, subtract, linear and circular address calcula- 
tion 
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Figure 6—6 shows the position of the unit in a line of assembly code. 


Figure 6-6. Units in the Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


Specifying the functional unit in the assembly code is optional. The functional 
unit can be used to document which resource(s) each instruction uses. 
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6.6 Operands 


The ’C6x architecture requires that memory reads and writes move data 
between memory and a register. Figure 6—7 shows the position of the oper- 
ands in a line of assembly code. 


Figure 6—7. Operands in the Assembly Code 


label: parallel bars [condition] instruction unit operands  ; comments 


Instructions have the following requirements for operands in the assembly 
code: 


(1 All instructions require a destination operand. 
_j Most instructions require one or two source operands. 


_j The destination operand must be in the same register file as one source 
operand. 


1 One source operand from each register file per execute packet can come 
from the register file opposite that of the other source operand. 


When an operand comes from the other register file, the unit includes an X, 
as shown in Figure 6-8, indicating that the instruction is using one of the 
cross paths. (See the TMS320C6x CPU and Instruction Set Reference 
Guide for more information on cross paths.) 


Figure 6—8. Operands in Instructions 
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ADD -L1 A0O,A1,A3 


ADD .L1X A0,B1,A3 


I 


All registers except B1 are on the same side of the CPU. 


The 'C6x instructions use three types of operands to access data: 
_j Register operands indicate a register that contains the data. 


1 Constant operands specify the data within the assembly code. 


(} Pointer operands contain addresses of data values. 


Only the load and store instructions require and use pointer operands to 
move data values between memory and a register. 


Comments 


6.7 Comments 


As with all programming languages, comments provide code documentation. 
Figure 6-9 shows the position of the comment in a line of assembly code. 


Figure 6-9. Comments in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


The following are guidelines for using comments in assembly code: 


_) Acomment may begin in any column when preceded by a semicolon (;). 
_) Acomment must begin in first column when preceded by an asterisk (*). 
_} Comments are not required but are recommended. 
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Chapter 7 


Optimizing Assembly Code 
via Linear Assembly 


This chapter describes methods that help you develop more efficient 
assembly language programs, understand the code produced by the 
assembly optimizer, and perform manual optimization. 


This chapter encompasses phase 3 of the code development flow. After you 
have developed and optimized your C code using the ‘C6x compiler, extract 
the inefficient areas from your C code and rewrite them in linear assembly (as- 
sembly code that has not been register-allocated and is unscheduled). 


The assembly code shown in this chapter has been hand-optimized in order 
to direct your attention to particular coding issues. The actual output from the 
assembly optimizer may look different, depending on the version you are us- 


ing. 
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7.1 Assembly Code 


7-2 


The source that you write for the assembly optimizer is similar to assembly 
source code; however, linear assembly does not include information about 
parallel instructions, instruction latencies, or register usage. The assembly op- 
timizer takes care of the difficulties of streamlining your code by: 


_j] Finding instructions that can be executed in parallel 
_j Handling pipeline latencies during software pipelining 
Lj] Assigning register usage 

(4 Defining which unit to use 


Although you have the option with the 'C6x to specify the functional unit or reg- 
ister used, this may restrict the compiler’s ability to fully optimize your code. 
See the TMS320C6x Optimizing C Compiler User’s Guide for more informa- 
tion. 


This chapter takes you through the optimization process manually to show you 
how the assembly optimizer works and to help you understand when you might 
want to perform some of the optimizations manually. Each section introduces 
optimization techniques in increasing complexity: 


(1 Section 7.3 and section 7.4 begin with a dot product algorithm to show you 
how to translate the C code to assembly code and then how to optimize 
the linear assembly code with several simple techniques. 


(1 Section 7.5 and section 7.6 introduce techniques for the more complex al- 
gorithms associated with software pipelining, such as modulo iteration in- 
terval scheduling for both single-cycle loops and multicycle loops. 


1 Section 7.7 uses an IIR filter algorithm to discuss the problems with loop 
carry paths. 


1 Section 7.8 and section 7.9 discuss the problems encountered with if- 
then-else statements in a loop and how loop unrolling can be used to re- 
solve them. 


(} Section 7.10 introduces live-too-long issues in your code. 


1 Section 7.11 uses a simple FIR filter algorithm to discuss redundant load 
elimination. 


[J Section 7.12 discusses the same FIR filter in terms of the interleaved 
memory bank scheme used by ’C6x devices. 


(j) Section 7.13 and section 7.14 show you how to execute the outer loop of 
the FIR filter conditionally and in parallel with the inner loop. 


Assembly Code 


Each example discusses the: 
(J Algorithm in C code 


Translation of the C code to linear assembly 


J 
_j Dependency graph to describe the flow of data in the algorithm 
J 


Allocation of resources (functional units, registers, and cross paths) in lin- 
ear assembly 


Te 


Note: 


There are three types of code for the ’C6x: C code (which is input for the C 
compiler), linear assembly code (which is input for the assembly optimizer), 


and assembly code (which is input for the assembler). 
| a) 


In the three sections following section 7.2, we use the dot product to demon- 
strate how to use various programming techniques to optimize both perfor- 
mance and code size. Most of the examples provided in this book use fixed- 
point arithmetic; however, the three sections following section 7.2 give both 
fixed-point and floating-point examples of the dot product to show that the 
same optimization techniques apply to both fixed- and floating-point pro- 
grams. 
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Options and Directives 


7.2 Assembly Optimizer Options and Directives 


All directives and options that are described in the following sections are listed 
in greater detail in Chapter 4 of the TMS320C6x Optimizing C Compiler User’s 
Guide. 


7.2.1_ The—mt Option and the .no_mdep Directive 


Because the assembly optimizer has no idea where objects you are accessing 
are located when you perform load and store instructions, the assembly opti- 
mizer is by default very conservative in determining dependencies between 
memory operations. For example, let us say you have the following loop de- 
fined in a linear assembly file: 


Example 7-1. Linear Assembly Block Copy 


loop: 
ldw 
add 
stw 
[reg6] add 
[reg6] b 


*regl++, reg2 
reg2, reg3, reg4 
reg4, *reg5++ 
-1, reg6, reg6é 
loop 


The assembly optimizer will make sure that each store to “reg5” completes be- 
fore the next load of “regi”. A suboptimal loop would result if the store to ad- 
dress in reg5 in not in the next location to be read by “reg1”. For loops where 
“reg5” is pointing to the next location of “reg1”, this is necessary and implies 
that the loop has a loop carry path (Refer to Section 7.7 Loop Carry Paths for 
more information). For most loops, this is not the case, and you can inform the 
assembly optimizer to be more aggressive about scheduling memory opera- 
tions. You can do this either by including the “.no_mdep” (no memory depen- 
dencies) directive in your linear assembly function or with the -mt option when 
you are compiling the linear assembly file. Be aware that if you are compiling 
both C code and linear assembly code in your application, that the -mt option 
has different meanings for both C and linear assembly code. In this case, use 
the .no_mdep directive in your linear assembly source files. For a full descrip- 
tion on the implications of .no_mdep and the -mt option, refer to Appendix A, 
Memory Alias Disambiguation. Refer to Chapter 4 of the Optimizing C Compil- 
er User’s Guide for more information on both the -mt option and the .no_mdep 
directive. 


7.2.2 The .mdep Directive 
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Should you need to specify a dependence between two or more memory refer- 
ences, use the .mdep directive. Annotate your code with memory reference 
symbols and add the .mdep directive to your linear assembly function. 
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Example 7-2. Block copy With .mdep 


-mdep ldl, stl 

LDW *pl++ {ldl}, inpl ; annotate memory reference ldl 
; other code ... 

STW outp2,*p2++ {stl} ; annotate memory reference stl 


The .mdep directive indicates there is a memory dependence from the LDW 
instruction to the STW instruction. This means that the STW instruction must 
come after the LDW instruction. The .mdep directive does not imply that there 
is amemory dependence from the STW to the LDW. Another .mdep directive 
would be needed to handle that case. 


7.2.3 The .mptr Directive 


The .mpir directive gives the assembly optimizer information on how to avoid 
memory bank conflicts. The assembly optimizer will rearrange the memory ref- 
erences generated in the assembly code to avoid the memory bank conflicts 
that were specified with the .mptr directive. This means that code generated 
by the assembly optimizer will be faster by avoiding the memory bank conflicts. 
Example 7-3 shows linear assembly code and the generated loop kernel for 
a dot product without the .mptr directive. 


Example 7-3. Linear Assembly Dot Product 


dotp: 


loop: 


-cproc ptr_a, ptr_b, cnt 

.reg vall, val2, val3, val4 
.reg prodl, prod2, suml, sum2 
zero suml 

zero sum2 

-trip 20, 20 


Optimizing Assembly Code via Linear Assembly 7-5 


Part Ill 


Part Ill 


Assembly Optimizer Options and Directives 


Example 7—-3.Linear Assembly Dot Product (Continued) 


1dh 
ldh 
mpy 
add 
ldh 
1dh 
mpy 
add 


[cnt] add 
] b 


add 


Al 


*ptr_at+, vall 
*ptr_bt++, val2 


vall, val2, prodl 
suml, prodl, suml 


*ptr_at+, vall 
*ptr_bt++, val2 


val3, val4, prod2 
sum2, prod2, sum2 


=l, Cnt, ent 
loop 


suml, sum2, suml 
return suml 
-endproc 


<loop kernel generated> 


PIPED LOOP KERNEL 

ADD .L2 B4,B6,B4 
MPY -M2X B7,A0,B6 

B .Sl1 loop 

LDH .D2T2 *—-B5 (2) ,B6 
LDH .D1T1 *-R4(2),A0 
SUB Sl Al1l,1,A1 

ADD L1 A5,A3,A5 
MPY M1X B6,A0,A3 
ADD -L2 -1,B0,BO 
LDH .D2T2 *B5++(4),B7 
LDH «DIT1 *A4++(4),A0 


7-6 


If the arrays pointed to by ptr_a and ptr_b begin on the same bank, then there 
will be memory bank conflicts at every cycle of the loop due to how the LDH 


instructions are paired. 


By adding the .mptr directive information, you can avoid the memory bank con- 
flicts. Example 7—4 shows the linear assembly dot product with the .mptr direc- 


tive and the resulting loop kernel. 
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Example 7-4. Linear Assembly Dot Product With .mptr 


dotp: .cproc ptr_a, ptr_b, cnt 
-reg vall, val2, val3, val4 
-reg prodl, prod2, suml, sum2 
zero suml 
zero sum2 
-mptr ptr_a, x, 4 
-mptr ptr_b, x, 4 
loop: .trip 20, 20 
ldh *ptr_at+, vall 
ldh *ptr_bt++, val2 
mpy vall, val2, prodl 
add suml, prodl, suml 
ldh *ptr_at+, val3 
dh *ptr_bt++, val4 
mpy val3, val4, prod2 
add sum2, prod2, sum2 
[cnt] add =l, :ont, ent 
[cnt] b loop 
add suml, sum2, suml 
return suml 
-endproc 
<loop kernel generated> 
Loop: , PIPED LOOP KERNEL 
[!A1] ADD -L2 B4,B6,B4 
MPY -M2X B8,A0,B6 
[ BO] B SL loop 
LDH .D2T2 *B5++ (4) ,B8 
LDH -D1T1 *-R4(2),A0 
[ Al] SUB S1 Al,1,A1 
[!A1] ADD .L1 A5,A3,A5 
MPY .M1X B7,A0,A3 
[ BO] ADD -L2 —-1,B0,BO 
LDH .D2T2 *—-B5 (2),B7 
LDH -D1T1 *A4++(4),A0 
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The above loop kernel has no memory bank conflicts in the case where ptr_a 
and pitr_b point to the same bank. This means that you have to know how your 
data is aligned in C code before using the .mpitr directive in your linear assem- 
bly code. The ’C6x compiler supports pragmas in C that align your data to a 
particular boundary (DATA_ALIGN, for example). Use these pragmas to align 
your data properly, so that the .mptr directives work in your linear assembly 
code. 


7.2.4 The .trip Directive 
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The .trip directive is analogous to the _nassert intrinsic for C. The .trip directive 
looks like: 


label: .trip minimum_value[, maximum value[, factor] ] 


For example if you wanted to say that the linear assembly loop will execute 
some minimum number of times, use the .trip directive with just the first para- 
meter. This example tells the assembly optimizer that the loop will iterate at 
least ten times. 


loop: .trip 10 


You can also tell the assembly optimizer that your loop will execute exactly 
some number of times by setting the minimum_value and maximum_value pa- 
rameters to exactly the same value. This next example tells the assembly opti- 
mizer that the loop will iterate exactly 20 times. 


loop: .trip 20, 20 


The maximum_value parameter can also tell the assembly optimizer that the 
loop will iterate between some range. The factor parameter allows the assem- 
bly optimizer to know that the loop will execute a factor of value times. For ex- 
ample, the next loop will iterate either 8, 16, 24, 32, 40, or 48 times when this 
particular linear assembly loop is called. 


loop: .trip 8, 48, 8 


The maximum_value and factor parameters are especially useful when your 
loop needs to be interruptible. Refer to section 8.4.4, Getting the Most Perfor- 
mance Out of Interruptible Code. 
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7.3 Writing Parallel Code 


One way to optimize linear assembly code is to reduce the number of execu- 
tion cycles in a loop. You can do this by rewriting linear assembly instructions 
so that the final assembly instructions execute in parallel. 


7.3.1. Dot Product C Code 


The dot product is a sum in which each element in array ais multiplied by the 
corresponding element in array b. Each of these products is then accumulated 
into sum. The C code in Example 7-5 is a fixed-point dot product algorithm. 
The C code in Example 7-6 is a floating-point dot product algorithm. 


Example 7-5. Fixed-Point Dot Product C Code 


int dotp(short a[], short b[]) 
{ 


int sum, i; 
sum = 0; 


for (i=0; i<100; i++) 
sum += a[i] * b[i]; 


return (sum) ; 


Example 7-6. Floating-Point Dot Product C Code 


float dotp(float a[]J, float b[]) 
{ 
Terie: Sass 
float sum; 
sum = 0; 
for (i=0; i<100; i++) 
sum += a[i] * b[i]; 


return (sum) ; 
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7.3.2 Translating C Code to Linear Assembly 


The first step in optimizing your code is to translate the C code to linear assem- 
bly. 


7.3.2.1 Fixed-Point Dot Product 


Example 7—7 shows the linear assembly instructions used for the inner loop 
of the fixed-point dot product C code. 


Example 7-7. List of Assembly Instructions for Fixed-Point Dot Product 


LDH a DAL: *A4++,A2 ; load ai from memory 

LDH .D1 *A3++,A5 ; load bi from memory 

MPY M1 A2,A5,A6 j ai * bi 

ADD a a A6,A7,A7 7; sum += (ai * bi) 

SUB ~S1 Al1,1,Al1 ; decrement loop counter 
[Al] B wS2 LOOP 7 branch to loop 


The load halfword (LDH) instructions increment through the a and b arrays. 
Each LDH does a postincrement on the pointer. Each iteration of these instruc- 
tions sets the pointer to the next halfword (16 bits) in the array. The ADD in- 
struction accumulates the total of the results from the multiply (MPY) instruc- 
tion. The subtract (SUB) instruction decrements the loop counter. 


An additional instruction is included to execute the branch back to the top of 
the loop. The branch (B) instruction is conditional on the loop counter, A1, and 
executes only until A1 is 0. 


7.3.2.2 Floating-Point Dot Product 


Example 7-8 shows the linear assembly instructions used for the inner loop 
of the floating-point dot product C code. 


Example 7-8. List of Assembly Instructions for Floating-Point Dot Product 


LDW -D1 *A4++,A2 ; load ai from memory 

LDW 7Di2 *A3++,A5 ; load bi from memory 

MPYSspt .M1 A2,A5,A6 j ai * bi 

ADDSPt pli: A6,A7,A7 ; sum += (ai * bi) 

SUB Roel A1,1,Al1 ; decrement loop counter 
[Al] B 2S2 LOOP 7 branch to loop 


t ADDSP and MPYSP are ’C67x (floating-point) instructions only. 


The load word (LDW) instructions increment through the a and barrays. Each 
LDW does a postincrement on the pointer. Each iteration of these instructions 
sets the pointer to the next word (82 bits) in the array. The ADDSP instruction 


Writing Parallel Code 


accumulates the total of the results from the multiply (MPYSP) instruction. The 
subtract (SUB) instruction decrements the loop counter. 


An additional instruction is included to execute the branch back to the top of 
the loop. The branch (B) instruction is conditional on the loop counter, A1, and 
executes only until A1 is 0. 


7.3.3 Linear Assembly Resource Allocation 


The following rules affect the assignment of functional units for Example 7—7 
and Example 7-8 (shown in the third column of each example): 


Load (LDH and LDW) instructions must use a .D unit. 
Multiply (MPY and MPYSP) instructions must use a .M unit. 
Add (ADD and ADDSP) instructions use a .L unit. 

Subtract (SUB) instructions use a .S unit. 

Branch (B) instructions must use a .S unit. 


HOUUUU 


Note: 


The ADD and SUB canbe on the.S, .L, or.D units; however, for Example 7—7 
and Example 7-8, they are assigned as listed above. 
The ADDSP instruction in Example 7-8 must use a .L unit. 


a) 


7.3.4 Drawing a Dependency Graph 


Dependency graphs can help analyze loops by showing the flow of instruc- 
tions and data in an algorithm. These graphs also show how instructions 
depend on one another. The following terms are used in defining a depen- 
dency graph. 


L) A node is a point on a dependency graph with one or more data paths 
flowing in and/or out. 


(| The path shows the flow of data between nodes. The numbers beside 
each path represent the number of cycles required to complete the instruc- 
tion. 


_) Aninstruction that writes to a variable is referred to as a parent instruction 
and defines a parent node. 


_) An instruction that reads a variable written by a parent instruction is re- 
ferred to as its child and defines a child node. 
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Use the following steps to draw a dependency graph: 


1) Define the nodes based on the variables accessed by the instructions. 
2) Define the data paths that show the flow of data between nodes. 

3) Add the instructions and the latencies. 
4) Add the functional units. 


7.3.4.1. Fixed-Point Dot Product 


Figure 7-1 shows the dependency graph for the fixed-point dot product 
assembly instructions shown in Example 7—7 and their corresponding register 
allocations. 


Figure 7-1. Dependency Graph of Fixed-Point Dot Product 
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(1 The two LDH instructions, which write the values of ai and bi, are parents 
of the MPY instruction. It takes five cycles for the parent (LDH) instruction 
to complete. Therefore, if LDH is scheduled on cycle i, then its child (MPY) 
cannot be scheduled until cycle i + 5. 


(1 The MPY instruction, which writes the product pi, is the parent of the ADD 
instruction. The MPY instruction takes two cycles to complete. 


(1 The ADD instruction adds pi (the result of the MPY) to sum. The output of 
the ADD instruction feeds back to become an input on the next iteration 
and, thus, creates a /oop carry path. (See section 7.7 on page 7-78 for 
more information on loop carry paths.) 
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The dependency graph for this dot product algorithm has two separate parts 
because the decrement of the loop counter and the branch do not read or write 
any variables from the other part. 


_) The SUB instruction writes to the loop counter, cntr. The output of the SUB 
instruction feeds back and creates a loop carry path. 


(4 The branch (B) instruction is a child of the loop counter. 


7.3.4.2 Floating-Point Dot Product 


Similarly, Figure 7-2 shows the dependency graph for the floating-point dot 
product assembly instructions shown in Example 7—8 and their corresponding 
register allocations. 


Figure 7-2. Dependency Graph of Floating-Point Dot Product 
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LJ The two LDW instructions, which write the values of ai and bi, are parents 
of the MPYSP instruction. It takes five cycles for the parent (L.DW) instruc- 
tion to complete. Therefore, if LDW is scheduled on cycle i, then its child 
(MPYSP) cannot be scheduled until cycle i+ 5. 


.) The MPYSP instruction, which writes the product pi, is the parent of the 
ADDSP instruction. The MPYSP instruction takes four cycles to complete. 


.) The ADDSP instruction adds pi (the result of the MPYSP) to sum. The 
output of the ADDSFP instruction feeds back to become an input on the next 
iteration and, thus, creates a loop carry path. (See section 7.7 on page 
7-78 for more information on loop carry paths.) 
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Writing Parallel Code 


The dependency graph for this dot product algorithm has two separate parts 
because the decrement of the loop counter and the branch do not read or write 
any variables from the other part. 


(1 The SUB instruction writes to the loop counter, cntr. The output of the SUB 
instruction feeds back and creates a loop carry path. 


(j The branch (B) instruction is a child of the loop counter. 


Writing Parallel Code 


7.3.5 Nonparallel Versus Parallel Assembly Code 


Nonparallel assembly code is performed serially, that is, one instruction follow- 
ing another in sequence. This section explains how to rewrite the instructions 
so that they execute in parallel. 


7.3.5.1 Fixed-Point Dot Product 


Example 7-9 shows the nonparallel assembly code for the fixed-point dot 
product loop. The MVK instruction initializes the loop counter to 100. The 
ZERO instruction clears the accumulator. The NOP instructions allow for the 
delay slots of the LDH, MPY, and B instructions. 


Executing this dot product code serially requires 16 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator; 100 it- 
erations require 1602 cycles. 


Example 7-9. Nonparallel Assembly Code for Fixed-Point Dot Product 


MVK 2S1 100, Al ; set up loop counter 

ZERO ey re A7 ; zero out accumulator 
LOOP: 

LDH .D1 *A44++,A2 ; load ai from memory 

LDH «Dal *A3++,A5 ; load bi from memory 

NOP 4 ; delay slots for LDH 

MPY .M1 A2,A5,A6 7; ai * bi 

NOP ; delay slot for MPY 

ADD Pal ivgl A6,A7,A7 ; sum += (ai * bi) 

SUB “Sil Al,1,Al1 ; decrement loop counter 

[Al] B 252 LOOP 7 branch to loop 

NOP 5 ; delay slots for branch 

; Branch occurs here 


Assigning the same functional unit to both LDH instructions slows perfor- 
mance of this loop. Therefore, reassign the functional units to execute the 
code in parallel, as shown in the dependency graph in Figure 7-3. The parallel 
assembly code is shown in Example 7-10. 
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Figure 7-3. Dependency Graph of Fixed-Point Dot Product with Parallel Assembly 
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Example 7-10. Parallel Assembly Code for Fixed-Point Dot Product 


MVK asl 100, Al 7 set up loop counter 
| | ZERO elk A7 ; zero out accumulator 
LOOP: 

LDH -D1 *A4++,A2 ; load ai from memory 
| | LDH .D2 *B4++,B2 ; load bi from memory 

SUB .S1 A1,1,Al1 ; decrement loop counter 

{[A1] B ~S2 LOOP 7 branch to loop 

NOP 2 ; delay slots for LDH 

MPY -M1X A2,B2,A6 P. sad “Aba 

NOP ; Gelay slots for MPY 

ADD «ak, A6,A7,A7 ; sum += (ai * bi) 

; Branch occurs here 


Because the loads of ai and bi do not depend on one another, both LDH 
instructions can execute in parallel as long as they do not share the same 
resources. To schedule the load instructions in parallel, allocate the functional 
units as follows: 


[1 ai and the pointer to ai to a functional unit on the A side, .D1 
_j bi and the pointer to bi to a functional unit on the B side, .D2 


Because the MPY instruction now has one source operand from A and one 
from B, MPY uses the 1X cross path. 


7-16 


Writing Parallel Code 


Rearranging the order of the instructions also improves the performance of the 
code. The SUB instruction can take the place of one of the NOP delay slots 
for the LDH instructions. Moving the B instruction after the SUB removes the 
need for the NOP 5 used at the end of the code in Example 7-9. 


The branch now occurs immediately after the ADD instruction so that the MPY 
and ADD execute in parallel with the five delay slots required by the branch 
instruction. 


7.3.5.2 Floating-Point Dot Product 


Similarly, Example 7-11 shows the nonparallel assembly code for the floating- 
point dot product loop. The MVK instruction initializes the loop counter to 100. 
The ZERO instruction clears the accumulator. The NOP instructions allow for 
the delay slots of the LDW, ADDSP, MPYSP, and B instructions. 


Executing this dot product code serially requires 21 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator; 100 it- 
erations require 2102 cycles. 


Example 7-11. Nonparallel Assembly Code for Floating-Point Dot Product 


MVK 2S 100, Al ; set up loop counter 

ZERO -L1 A7 ; zero out accumulator 
LOOP: 

LDW .D1 *A44++,A2 ; load ai from memory 

LDW sD *A3++,A5 ; load bi from memory 

NOP 4 ; delay slots for LDW 

MPYSP .M1 A2,A5,A6 7 al * bi 

NOP 3 ; delay slots for MPYSP 

ADDSP .L1 A6,A7,A7 ; sum += (ai * bi) 

NOP 3 ; delay slots for ADDSP 

SUB cel Al,1,Al1 ; decrement loop counter 

[Al] B S2 LOOP 7 branch to loop 

NOP 5 ; delay slots for branch 

; Branch occurs here 


Assigning the same functional unit to both LDW instructions slows perfor- 
mance of this loop. Therefore, reassign the functional units to execute the 
code in parallel, as shown in the dependency graph in Figure 7—4. The parallel 
assembly code is shown in Example 7-12. 
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Figure 7-4. Dependency Graph of Floating-Point Dot Product with Parallel Assembly 


Example 7-12. Parallel Assembly Code for Floating-Point Dot Product 
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Because the loads of ai and bi do not depend on one another, both LDW 
instructions can execute in parallel as long as they do not share the same 
resources. To schedule the load instructions in parallel, allocate the functional 


units as follows: 


(1 ai and the pointer to ai to a functional unit on the A side, .D1 
_j bi and the pointer to bi to a functional unit on the B side, .D2 


Because the MPYSFP instruction now has one source operand from A and one 
from B, MPYSP uses the 1X cross path. 


Writing Parallel Code 


Rearranging the order of the instructions also improves the performance of the 
code. The SUB instruction replaces one of the NOP delay slots for the LDW 
instructions. Moving the B instruction after the SUB removes the need for the 
NOP 5 used at the end of the code in Example 7-11 on page 7-17. 


The branch now occurs immediately after the ADDSP instruction so that the 
MPYSP and ADDSP execute in parallel with the five delay slots required by 
the branch instruction. 


Since the ADDSP finishes execution before the result is needed, the NOP 3 
for delay slots is removed, further reducing cycle count. 
7.3.6 Comparing Performance 


Executing the fixed-point dot product code in Example 7—10 requires eight 
cycles for each iteration plus one cycle to set up the loop counter and initialize 
the accumulator; 100 iterations require 801 cycles. 


Table 7—1 compares the performance of the nonparallel code with the parallel 
code for the fixed-point example. 


Table 7-1. Comparison of Nonparallel and Parallel Assembly Code for Fixed-Point 


Dot Product 
Code Example 100 Iterations = Cycle Count 
Example 7-9 _ Fixed-point dot product nonparallel assembly 2+100 x 16 1602 
Example 7-10 Fixed-point dot product parallel assembly 1+100 x 8 801 


Executing the floating-point dot product code in Example 7—12 requires ten 
cycles for each iteration plus one cycle to set up the loop counter and initialize 
the accumulator; 100 iterations require 1001 cycles. 


Table 7—2 compares the performance of the nonparallel code with the parallel 
code for the floating-point example. 


Table 7-2. Comparison of Nonparallel and Parallel Assembly Code for Floating-Point 


Dot Product 
Code Example 100 Iterations Cycle Count 
Example 7-11 Floating-point dot product nonparallel assembly 2+ 100 x 21 2102 
Example 7-12 Floating-point dot product parallel assembly 1+100 x 10 1001 
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7.4 Using Word Access for Short Data and Doubleword Access for 


7.4.1 


Floating-Point Data 


The parallel code for the fixed-point example in section 7.3 uses an LDH 
instruction to read ali]. Because ali] and a[i+1] are next to each other in 
memory, you can optimize the code further by using the load word (LDW) 
instruction to read a[i] and a[i+ 1] at the same time and load both into a single 
32-bit register. (The data must be word-aligned in memory.) 


In the floating-point example, the parallel code uses an LDW instruction to read 
a[i]. Because ali] and a[i+1] are next to each other in memory, you can opti- 
mize the code further by using the load doubleword (LDDW) instruction to read 
a[i] and a[i+ 1] at the same time and load both into a register pair. (The data 
must be doubleword-aligned in memory.) See the TMS320C62x/C67x CPU 
and Instruction Set User’s Guide for more specific information on the LDDW 
instruction. 


A ————— — — — — — see 
Note: 


The load doubleword (LDDW) instruction is only available on the ’C67x 


(floating-point) device. 
ss | 


Unrolled Dot Product C Code 


The fixed-point C code in Example 7—13 has the effect of unrolling the loop by 

accumulating the even elements, ali] and b[i], into sum0 and the odd elements, 

a[i+ 1] and b[i+1], into sum1. After the loop, sum0 and sum1 are added to pro- 
duce the final sum. The same is true for the floating-point C code in 

Example 7-14. (For another example of loop unrolling, see section 7.9 on 

page 7-95.) 


Example 7-13. Fixed-Point Dot Product C Code (Unrolled) 


int dotp(short a[], short b[] ) 
{ 


int sum0O, suml, sum, i; 


sum0 = 0; 
suml = 0; 
for (i=0; i<100; it=2){ 


sumO += a[i] * b[i]; 
suml += a[i + 1] * b[i + 1]; 
} 

sum = sum0 + suml; 

return (sum); 
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Example 7-14. Floating-Point Dot Product C Code (Unrolled) 


float dotp(float a[], float bf[]) 
{ 

int i; 

float sum0, suml, sum; 

sum0 = 0; 

suml = 0; 


for (i=0; i<100; it=2) { 
sum0 += ali] * bl[il]; 
suml += a[i + 1] * b[i + 1]; 
} 

sum = sum0O + suml; 

return (sum) ; 


7.4.2 Translating C Code to Linear Assembly 


The first step in optimizing your code is to translate the C code to linear assem- 
bly. 


7.4.2.1 Fixed-Point Dot Product 


Example 7-15 shows the list of ’C6x instructions that execute the unrolled 
fixed-point dot product loop. Symbolic variable names are used instead of ac- 
tual registers. Using symbolic names for data and pointers makes code easier 
to write and allows the optimizer to allocate registers. However, you must use 
the .reg assembly optimizer directive. See the TMS320C6x Optimizing C 
Compiler User’s Guide for more information on writing linear assembly code. 


Example 7-15. Linear Assembly for Fixed-Point Dot Product Inner Loop with LDW 


LDW *at+,ai_il ; load ai & al from memory 

LDW *b++,bi_il ; load bi & bl from memory 

MPY ai_il,bi_il,pi ; ai * bi 

MP YH ai_il,bi_il,pil ; ait+l * bi+l 

ADD pi,sum0, sum0 ; sum0 += (ai * bi) 

ADD pil, suml1,suml ; suml += (aitl * bitl) 
[cntr] SUB ent r, 1, centr ; decrement loop counter 
{[cntr] B LOOP 7 branch to loop 


The two load word (LDW) instructions load a{i], afi+1], b[i], and b[i+1] on each 
iteration. 
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Two MPY instructions are now necessary to multiply the second set of array 
elements: 


(j The first MPY instruction multiplies the 16 least significant bits (LSBs) in 
each source register: ali] x b[i]. 


(1 The MPYH instruction multiplies the 16 most significant bits (MSBs) of 
each source register: a[i+1] x b [i+1]. 


The two ADD instructions accumulate the sums of the even and odd elements: 
sum0 and sum1. 


Note: 


This is true only when the ’C6x is in little-endian mode. In big-endian mode, 
MPY operates on afi+1] and b[i+1] and MPYH operates on a[i] and b[i]. See 
the TMS320C62x/C67x Peripherals Reference Guide for more information. 


| 


7.4.2.2 Floating-Point Dot Product 


Example 7-16 shows the list of ’C6x instructions that execute the unrolled 
floating-point dot product loop. Symbolic variable names are used instead of 
actual registers. Using symbolic names for data and pointers makes code eas- 
ier to write and allows the optimizer to allocate registers. However, you must 
use the .reg assembly optimizer directive. See the TMS320C6x Optimizing C 
Compiler User’s Guide for more information on writing linear assembly code. 


Example 7-16. Linear Assembly for Floating-Point Dot Product Inner Loop with LDDW 


LDDW *a++,ail:aid ; load a[it+0O] & a[it+tl] from memory 
LDDW *b++,bil:bid ; load b[i+0] & b[it+t1l] from memory 
MPYSP ai0,bi0,pi0 ; a[itO] * b[it0] 
MPYSP ail,bil,pil ; a[itl] * b[i+1] 
ADDSP pid, sum0, sum0 , sumO += (a[it+0O] * b[it+0]) 
ADDSP pil,suml,suml , suml += (a[it+l] * b[it1]) 

[cntr] SUB entr,1,;cntr ; decrement loop counter 

[cntr] B LOOP 7 branch to loop 


The two load doubleword (LDDW) instructions load a[i], a[i+1], b[i], and b[i+1] 
on each iteration. 


Two MPYSP instructions are now necessary to multiply the second set of array 
elements. 


The two ADDSP instructions accumulate the sums of the even and odd 
elements: sum0 and sum1. 
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7.4.3 Drawing a Dependency Graph 


The dependency graph in Figure 7—5 for the fixed-point dot product shows that 
the LDW instructions are parents of the MPY instructions and the MPY instruc- 
tions are parents of the ADD instructions. To split the graph between the A and 
B register files, place an equal number of LDWs, MPYs, and ADDs on each 
side. To keep both sides even, place the remaining two instructions, B and 
SUB, on opposite sides. 


Figure 7-5. Dependency Graph of Fixed-Point Dot Product With LDW 
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Similarly, the dependency graph in Figure 7-6 for the floating-point dot prod- 
uct shows that the LDDW instructions are parents of the MPYSP instructions 
and the MPYSP instructions are parents of the ADDSP instructions. To split 
the graph between the A and B register files, place an equal number of 
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LDDWs, MPYSPs, and ADDSPs on each side. To keep both sides even, place 
the remaining two instructions, B and SUB, on opposite sides. 


Figure 7-6. Dependency Graph of Floating-Point Dot Product With LDDW 


A side B side 
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7.4.4 Linear Assembly Resource Allocation 


After splitting the dependency graph for both the fixed-point and floating-point 
dot products, you can assign functional units and registers, as shown in the 
dependency graphs in Figure 7—7 and Figure 7-8 and in the instructions in 
Example 7-17 and Example 7-18. The .M1X and .M2X representa path in the 
dependency graph crossing from one side to the other. 
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Figure 7-7. Dependency Graph of Fixed-Point Dot Product With LDW (Showing 
Functional Units) 
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Example 7-17. Linear Assembly for Fixed-Point Dot Product Inner Loop With LDW 
(With Allocated Resources) 


LDW «D1 *A44++,A2 ; load ai and ait+l from memory 
LDW .D2 *B44++,B2 ; load bi and bit+1l from memory 
MPY .M1X A2,B2,A6 ; ai * bi 
MPYH .M2X A2,B2,B6 ; aitl * bitl 
ADD -L1 A6,A7,A7 ; sum0O += (ai * bi) 
ADD .-L2 B6,B7,B7 ; suml += (aitl * bi+l1) 
SUB 81 Al1,1,Al1 ; decrement loop counter 
[Al] B {Sz LOOP 7 branch to loop 
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Figure 7-8. Dependency Graph of Floating-Point Dot Product With LDDW (Showing 
Functional Units) 
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Example 7-18. Linear Assembly for Floating-Point Dot Product Inner Loop With LDDW 
(With Allocated Resources) 


LDDW -D1 *A4++,A3:A2 load ai and ait+l from memory 
LDDW .D2 *B4++,B3:B2 load bi and bit+1l from memory 
MPYSP .M1X A2,B2,A6 ai * bi 

MPYSP .M2X A3,B3,B6 aitl * bitl 

ADDSP .L1 A6,A7,A7 sum0 += (ai * bi) 


ADDSP .L2 B6,B7,B7 
SUB .S1 Al,1,A1 
[A1] B .S2 LOOP 


suml += (ait+l * bitl1) 
decrement loop counter 
branch to loop 


Ne Ne Ne Ne Ne Ne Ne Ne 
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7.4.5 Final Assembly 


Example 7-19 shows the final assembly code for the unrolled loop of the fixed- 
point dot product and Example 7—20 shows the final assembly code for the 
unrolled loop of the floating-point dot product. 


7.4.5.1 Fixed-Point Dot Product 


Example 7-19 uses LDW instructions instead of LDH instructions. 


Example 7-19. Assembly Code for Fixed-Point Dot Product With LDW 
(Before Software Pipelining) 


[Al] 


NOP 


ADD 
ADD 


; Branch occurs here 


ADD 


-S1 
-L1 
-L2 


L1 
~L2 


.L1X 


50,Al1 ; set up loop counter 

A7 ; zero out sum0 accumulator 
B7 ; zero out suml accumulator 
*A44++,A2 ; load ai & ai+l from memory 
*B44++,B2 ; load bi & bit+l from memory 
A1l,1,Al1 ; decrement loop counter 
LOOP 7 branch to loop 

A2,B2,A6 a aa. pa 

A2,B2,B6 7; aitl * bi+l 

A6,A7,A7 ; sum0+= (ai * bi) 

B6,B7,B7 ; suml+= (ait+l * bitl) 
A7,B7,A4 ; sum = sum0O + suml 


The code in Example 7—19 includes the following optimizations: 


.) The setup code for the loop is included to initialize the array pointers and 
the loop counter and to clear the accumulators. The setup code assumes 
that A4 and B4 have been initialized to point to arrays aand b, respectively. 


.) The MVK instruction initializes the loop counter. 


_) The two ZERO instructions, which execute in parallel, initialize the even 
and odd accumulators (sum0 and sum1) to 0. 


_) The third ADD instruction adds the even and odd accumulators. 
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7.4.5.2 Floating-Point Dot Product 


Example 7—20 uses LDDW instructions instead of LDW instructions. 


Example 7-20. Assembly Code for Floating-Point Dot Product With LDDW 


(Before Software Pipelining) 


[Al] 


MVK OL 50,Al ; set up loop counter 

ZERO -L1 A7 ; zero out sum0O accumulator 
ZERO Pays B7 ; zero out suml accumulator 
LDDW . Del *A4++,A2 ; load ai & ait+tl from memory 
LDDW ~D2 *B4++,B2 ; load bi & bitl from memory 
SUB ook Al1,1,Al1 ; decrement loop counter 
NOP 2 

B +S LOOP ; branch to loop 

MPYSP .M1X A2,B2,A6 7; ai * bi 

MPYSP .M2X A3,B3,B6 yj; aitl * bitl 

NOP 3 

ADDSP .L1 A6,A7,A7 ; sumO += (ai * bi) 

ADDSP .L2 B6,B7,B7 , suml += (aitl * bit1) 

; Branch occurs here 

NOP 3 

ADDSP .L1X A7,B7,A4 ; sum = sum0 + suml 

NOP 3 
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The code in Example 7—20 includes the following optimizations: 


(1 The setup code for the loop is included to initialize the array pointers and 
the loop counter and to clear the accumulators. The setup code assumes 
that A4 and B4 have been initialized to point to arrays aand b, respectively. 


(1 The MVK instruction initializes the loop counter. 


(1 The two ZERO instructions, which execute in parallel, initialize the even 
and odd accumulators (sum0 and sum1) to 0. 


1 The third ADDSP instruction adds the even and odd accumulators. 
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7.4.6 Comparing Performance 


Executing the fixed-point dot product with the optimizations in Example 7-19 
requires only 50 iterations, because you operate in parallel on both the even 
and odd array elements. With the setup code and the final ADD instruction, 100 
iterations of this loop require a total of 402 cycles (1 +8 x 50+ 1). 


Table 7-3 compares the performance of the different versions of the fixed- 
point dot product code discussed so far. 


Table 7—3. Comparison of Fixed-Point Dot Product Code With Use of LDW 


Code Example 100 Iterations Cycle Count 


Example 7-9 _ Fixed-point dot product nonparallel assembly 2+ 100 x 16 1602 
Example 7-10  Fixed-point dot product parallel assembly 1+100 x 8 801 
Example 7-19 Fixed-point dot product parallel assembly with LDW 1+ (50x 8)+1 402 


Executing the floating-point dot product with the optimizations in 
Example 7-20 requires only 50 iterations, because you operate in parallel on 
both the even and odd array elements. With the setup code and the final 
ADDSP instruction, 100 iterations of this loop require a total of 508 cycles (1 
+10 x 50+ 7). 


Table 7-4 compares the performance of the different versions of the floating- 
point dot product code discussed so far. 
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Table 7-4. Comparison of Floating-Point Dot Product Code With Use of LDDW 


Code Example 100 Iterations Cycle Count 


Example 7-11 Floating-point dot product nonparallel assembly 2+ 100 x 21 2102 
Example 7-12 Floating-point dot product parallel assembly 1+100 x 10 1001 
Example 7-20 Floating-point dot product parallel assembly with LDDW 1+(50x 10)+ 7 508 
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7.5 Software Pipelining 


This section describes the process for improving the performance of the as- 
sembly code in the previous section through software pipelining. 


Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations execute in parallel. The parallel resources on the 
’°C6x make it possible to initiate a new loop iteration before previous iterations 
finish. The goal of software pipelining is to start a new loop iteration as soon 
as possible. 


The modulo iteration interval scheduling table is introduced in this section as 
an aid to creating software-pipelined loops. 


The fixed-point dot product code in Example 7-19 needs eight cycles for each 
iteration of the loop: five cycles for the LDWs, two cycles forthe MPYs, and one 
cycle for the ADDs. 


Figure 7-9 shows the dependency graph for the fixed-point dot product 
instructions. Example 7—21 shows the same dot product assembly code in 
Example 7—17 on page 7-25, except that the SUB instruction is now condition- 
al on the loop counter (A1). 


a | 


Note: 


Making the SUB instruction conditional on A1 ensures that A1 stops decre- 
menting when it reaches 0. Otherwise, as the loop executes five more times, 
the loop counter becomes a negative number. When A1 is negative, it is non- 
zero and, therefore, causes the condition on the branch to be true again. If the 


SUB instruction were not conditional on A1, you would have an infinite loop. 
a) 


The floating-point dot product code in Example 7-20 needs ten cycles for each 
iteration of the loop: five cycles for the LDDWs, four cycles for the MPYSPs, 
and one cycle for the ADDSPs. 


Figure 7-10 shows the dependency graph for the floating-point dot product 
instructions. Example 7—22 shows the same dot product assembly code in 
Example 7—18 on page 7-26, except that the SUB instruction is now condition- 
al on the loop counter (A1). 


a | 


Note: 


The ADDSP has 3 delay slots associated with it. The extra delay slots are 
taken up by the LDDW, SUB, and NOP when executing the next cycle of the 
loop. Thus an NOP 3 is not required inside the loop but is required outside 


the loop prior to adding sum0 and sum1 together. 
a | 


Software Pipelining 


Figure 7-9. Dependency Graph of Fixed-Point Dot Product With LDW 
(Showing Functional Units) 
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Example 7-21. Linear Assembly for Fixed-Point Dot Product Inner Loop 
(With Conditional SUB Instruction) 


LDW .D1 *A44++,A2 ; load ai and ait+l from memory 
LDW .D2 *B44++,B2 ; load bi and bit+1l from memory 
MPY .M1X A2,B2,A6 ; ai * bi 
MPYH .M2X A2,B2,B6 ; aitl * bitl 
ADD -L1 A6,A7,A7 ; sum0O += (ai * bi) 
ADD .L2 B6,B7,B7 ; suml += (aitl * bi+l1) 

[Al] SUB oil ANI la Aull ; decrement loop counter 

[Al] B +82 LOOP 7 branch to top of loop 
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Figure 7-10. Dependency Graph of Floating-Point Dot Product With LDDW 
(Showing Functional Units) 


A side B side 
LDDW ; LDDW 
D1 : D2 
5 BX 
MPYSP | 
«M2X 


.L2 


Example 7-22. Linear Assembly for Floating-Point Dot Product Inner Loop 
(With Conditional SUB Instruction) 


ADDSP .L2 B6,B7,B7 
[Al] SUB Sil ANAL, i, NAL 
[Al] B .S2 LOOP 


suml += (ait+l * bit+t1) 
decrement loop counter 
branch to top of loop 


LDDW -D1 *AR44++,A2 ; load ai and ait+tl from memory 
LDDW 7 D2 *B4++,B2 ; load bi and bitl from memory 
MPYSP .M1X A2,B2,A6 ;, ai * bi 
MPYSP .M2X A2,B2,B6 j; aitl * bit+l 
ADDSP .L1 A6,A7,A7 ; sumO += (ai * bi) 

lA 

, 

, 
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7.5.1. Modulo Iteration Interval Scheduling 


Another way to represent the performance of the code is by looking at it ina 
modulo iteration interval scheduling table. This table shows how a 
software-pipelined loop executes and tracks the available resources on a 
cycle-by-cycle basis to ensure that no resource is used twice on any given 
cycle. The iteration interval of aloop is the number of cycles between the initia- 
tions of successive iterations of that loop. 


7.5.1.1 Fixed-Point Example 


The fixed-point code in Example 7-19 needs eight cycles for each iteration of 
the loop, so the iteration interval is eight. 


Table 7-5 shows a modulo iteration interval scheduling table for the fixed-point 
dot product loop before software pipelining (Example 7—19). Each row repre- 
sents a functional unit. There is a column for each cycle in the loop showing 
the instruction that is executing on a particular cycle: 


LDWs on the .D units are issued on cycles 0, 8, 16, 24, etc. 

MPY and MPYH on the .M units are issued on cycles 5, 13, 21, 29, etc. 
ADDs on the .L units are issued on cycles 7, 15, 23, 31, etc. 

SUB on the .S1 unit is issued on cycles 1, 9, 17, 25, etc. 

B on the .S2 unit is issued on cycles 2, 10, 18, 24, etc. 


DHOUUdUU 


Table 7—5. Modulo Iteration Interval Scheduling Table for Fixed-Point Dot Product 
(Before Software Pipelining) 


Unit / Cycle 


.D1 


0, 8, ... 
LDW 


1,9, ... 2,10,... | 3,11,.. | 4,12,.. | 5,13,.. ) 6,14)... | 7, 15,... 


.D2 


LDW 


M1 


MPY 


.M2 


MPYH 


LI 


ADD 


.L2 


ADD 


S1 


SUB 


S2 


B 


In this example, each unit is used only once every eight cycles. 
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7.5.1.2 Floating-Point Example 


The floating-point code in Example 7—20 needs ten cycles for each iteration 
of the loop, so the iteration interval is ten. 


Table 7-6 shows a modulo iteration interval scheduling table for the floating- 
point dot product loop before software pipelining (Example 7—20). Each row 
represents a functional unit. There is a column for each cycle in the loop show- 
ing the instruction that is executing on a particular cycle: 


(_j LDDWs on the .D units are issued on cycles 0, 10, 20, 30, etc. 

1 MPYSPs and on the .M units are issued on cycles 5, 15, 25, 35, etc. 
(1 ADDSPs on the .L units are issued on cycles 9, 19, 29, 39, etc. 

(1 SUB onthe .S1 unit is issued on cycles 3, 13, 23, 33, etc. 

(1 Bon the .S2 unit is issued on cycles 4, 14, 24, 34, etc. 


Table 7-6. Modulo Iteration Interval Scheduling Table for Floating-Point Dot Product 
(Before Software Pipelining) 


M1 MPYSP 


.M2 MPYSP 


LY ADDSP 


.L2 ADDSP 


In this example, each unit is used only once every ten cycles. 
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7.5.1.3 Determining the Minimum Iteration Interval 


Software pipelining increases performance by using the resources more effi- 
ciently. However, to create a fully pipelined schedule, it is helpful to first deter- 
mine the minimum iteration interval. 


The minimum iteration interval of a loop is the minimum number of cycles you 
must wait between each initiation of successive iterations of that loop. The 
smaller the iteration interval, the fewer cycles it takes to execute a loop. 


Resources and data dependency constraints determine the minimum iteration 
interval. The most-used resource constrains the minimum iteration interval. 
For example, if four instructions in a loop all use the .S1 unit, the minimum it- 
eration interval is at least 4. Four instructions using the same resource cannot 
execute in parallel and, therefore, require at least four separate cycles to 
execute each instruction. 


With the SUB and branch instructions on opposite sides of the dependency 
graph in Figure 7-9 and Figure 7—10, all eight instructions use a different func- 
tional unit and no two instructions use the same cross paths (1X and 2X). 
Because no two instructions use the same resource, the minimum iteration in- 
terval based on resources is 1. 


TS | 
Note: 


In this particular example, there are no data dependencies to affect the 
minimum iteration interval. However, future examples may demonstrate this 


constraint. 
SS Ow_ SSSI 


7.5.1.4 Creating a Fully Pipelined Schedule 
Having determined that the minimum iteration interval is 1, you can initiate a 
new iteration every cycle. You can schedule LDW (or LDDW) and MPY (or 
MPYSP) instructions on every cycle. 


Fixed-Point Example 


Table 7—7 shows a fully pipelined schedule for the fixed-point dot product ex- 
ample. 
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Table 7—7. Modulo Iteration Interval Table for Fixed-Point Dot Product 


(After Software Pipelining) 


Loop Prolog TT 
Unit / Cycle 0 1 2 3 4 5 6 760: 
a LDW Low LOW Low LOW LOW LOW cow 
ee LDW Low Low LOW Low LOW LOW Pow 
‘M1 MPY MPY nae 
‘M2 ev | MPYH ane 
u ADS 
Le ADE 
SI SUB SUB SUB SUB SUB SUB AD 
~~ ; : : z . = 


Note: The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop. 


The rightmost column in Table 7-7 is a single-cycle loop that contains the 
entire loop. Cycles 0-6 are loop setup code, or loop prolog. 


Asterisks define which iteration of the loop the instruction is executing each 
cycle. For example, the rightmost column shows that on any given cycle inside 
the loop: 


_j The ADD instructions are adding data for iteration n. 

_j The MPY instructions are multiplying data for iteration n + 2 (**). 
_j The LDW instructions are loading data for iteration n + 7 (*******). 
(4 The SUB instruction is executing for iteration n + 6 (******). 

Lj) The B instruction is executing for iteration n + 5 (*****). 


In this case, multiple iterations of the loop execute in parallel in a software pipe- 
line that is eight iterations deep, with iterations n through n + 7 executing in par- 
allel. Fixed-point software pipelines are rarely deeper than the one created by 
this single-cycle loop. As loop sizes grow, the number of iterations that can 
execute in parallel tends to become fewer. 


Floating-Point Example 


Software Pipelining 


Table 7-8 shows a fully pipelined schedule for the floating-point dot product 


example. 


Table 7—8. Modulo Iteration Interval Table for Floating-Point Dot Product 


(After Software Pipelining) 


Loop Prolog 

Unit / 

Cycle 0 1 2 3 4 5 6 7 8 9, 10, 11... 
D1 | pow | voow | topw | Lopw | Lopw | Lopw | Loow | Lopw | Loow | Lopw 
22 | Lppw | Lopw | Loow  LDW | Low LDpw | Lopw | Lopw | Loow | Lopw 
am upyse |meyse | mpvse |mpvse | mpysp 
ane upysp | mpvse | mpyse | mpvsP | MpYSP 
LI ADDSP 
.L2 ADDSP 
= sue | sue | sus | sus | sus | sus | SUB 
es : : : 7 ] i 
Note: The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop. 


The rightmost column in Table 7-8 is a single-cycle loop that contains the 
entire loop. Cycles 0-8 are loop setup code, or loop prolog. 


Asterisks define which iteration of the loop the instruction is executing each 
cycle. For example, the rightmost column shows that on any given cycle inside 


the loop: 


OHOUOUCU 


The ADDSP instructions are adding data for iteration n. 
The MPYSFP instructions are multiplying data for iteration n + 4 (****). 
The LDDW instructions are loading data for iteration n + 9 
The SUB instruction is executing for iteration n + 6 
The B instruction is executing for iteration n + 5 ( 


(ares) 


pa | 
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Te 


Note: 


Since the ADDSP instruction has three delay slots associated with it, the re- 
sults of adding are staggered by four. That is, the first result from the ADDSP 
is added to the fifth result, which is then added to the ninth, and so on. The 
second result is added to the sixth, which is then added to the 10th. This is 
shown in Table 7-9. 


ss | 


In this case, multiple iterations of the loop execute in parallel in a software pipe- 
line that is ten iterations deep, with iterations n through n + 9 executing in paral- 
lel. Floating-point software pipelines are rarely deeper than the one created 
by this single-cycle loop. As loop sizes grow, the number of iterations that can 
execute in parallel tends to become fewer. 


7.5.1.5 Staggered Accumulation With a Multicycle Instruction 


When accumulating results with an instruction that is multicycle (that is, has 
delay slots other than 0), you must either unroll the loop or stagger the results. 
When unrolling the loop, multiple accumulators collect the results so that one 
result has finished executing and has been written into the accumulator before 
adding the next result of the accumulator. If you do not unroll the loop, then the 
accumulator will contain staggered results. 


Staggered results occur when you attempt to accumulate successive results 
while in the delay slots of previous execution. This can be achieved without 
error if you are aware of what is in the accumulator, what will be added to that 
accumulator, and when the results will be written on a given cycle (such as the 
pseudo-code shown in Example 7-23). 


Example 7-23. Pseudo-Code for Single-Cycle Accumulator With ADDSP 


LOOP: ADDSP 
| | LDW 
|| [cond] B 
|| [cond] SUB 


x, sum, sum 
*xptrt+t+,x 
cond 

cond, 1,cond 


Table 7-9 shows the results of the loop kernel for a single-cycle accumulator 
using a multicycle add instruction; in this case, the ADDSP, which has three 
delay slots (a 4-cycle instruction). 


Software Pipelining 


Table 7-9. Software Pipeline Accumulation Staggered Results Due to Three-Cycle 


Delay 
Current value of 
Cycle # Pseudoinstruction pseudoregister sum Written expected result 

0 ADDSP x(0), sum, sum 0 ; cycle 4 sum = x(0) 

1 ADDSP x(1), sum, sum 0 ; cycle 5 sum = x(1) 

2 ADDSP x(2), sum, sum 0 ; cycle 6 sum = x(2) 

3 ADDSP x(3), sum, sum 0 ; cycle 7 sum = x(3) 

4 ADDSP x(4), sum, sum x(0) ; cycle 8 sum = x(0) + x(4) 
5 ADDSP x(5), sum, sum x(1) ; cycle 9 sum = x(1) + x(5) 
6 ADDSP x(6), sum, sum x(6) ; cycle 10 sum = x(2) + x(6) 
7 ADDSP x(7), sum, sum x(7) ; cycle 11 sum = x(3) + x(7) 
8 ADDSP x(8), sum, sum x(0) + x(4) ; cycle 12 sum = x(0) + x(8) 

i+jf ADDSP x(i+j), sum, sum X(j) + X(j+4) + x(j+8) ... x(i-44j) 5 cycle i+j+4sum = x(j) + x(j+4) + 


x(j+8) ... x(i-4-4j) + x(i4i) 


T where i is a multiple of 4 


The first value of the array x, x(0) is added to the accumulator (sum) on cycle 
0, but the result is not ready until cycle 4. This means that on cycle 1 when x(1) 
is added to the accumulator (sum), sum has no value in it from x(0). Thus, 
when this result is ready on cycle 5, sum will have the value x(1) in it, instead 
of the value x(0) + x(1). When you reach cycle 4, sum will have the value x(0) 
in it and the value x(4) will be added to that, causing sum = x(0) + x(4) on 
cycle 8. This is continuously repeated, resulting in four separate accumula- 
tions (using the register “sum”). 


The current value in the accumulator “sum” depends on which iteration is be- 
ing done. After the completion of the loop, the last four sums should be written 
into separate registers and then added together to give the final result. This 
is shown in Example 7—27 on page 7-44. 
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7.5.2 Using the Assembly Optimizer to Create Optimized Loops 


Example 7—24 shows the linear assembly code for the full fixed-point dot prod- 
uct loop. Example 7—25 shows the linear assembly code for the full floating- 
point dot product loop. You can use this code as input to the assembly optimiz- 
er tool to create software-pipelined loops automatically. See the TMS320C6x 
Optimizing C Compiler User’s Guide for more information on the assembly op- 
timizer. 


Example 7-24. Linear Assembly for Full Fixed-Point Dot Product 


-global _dotp 
_dotp: -cproc a, b 
.reg sum, sum0, suml, cntr 
-reg addy, Died, iy: pil 
MVK 50,cntr > cntr = 100/2 
ZERO sum0 , multiply result = 0 
ZERO suml ; multiply result = 0 
LOOP: -trip 50 
LDW *at++,ai_il ; load ai & ait+tl from memory 
LDW *bo++,bi_il ; load bi & bitl from memory 
MPY ai_il,bi_il,pi ; ai * bi 
MPYH ai_il,bi_il,pil ; ait+l * bi+l 
ADD pi, sum0, sum0 ; sum0O += (ai * bi) 
ADD pil,suml,suml , suml += (ait+tl * bitl1) 
[cntr] SUB Cher, contr ; Gecrement loop counter 
[cntr] B LOOP ; branch to loop 
ADD sum0,suml, sum ; compute final result 
.-return sum 
-endproc 


Resources such as functional units and 1X and 2X cross paths do not have 
to be specified because these can be allocated automatically by the assembly 
optimizer. 
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Example 7-25. Linear Assembly for Full Floating-Point Dot Product 
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-global _dotp 


_dotp: -cproc a; Bb 
.reg sum, sum0, suml, a, b 
.reg ai:ail, bi:bil, pi, pil 
MVK 50,cntr 7 sentr = 100/2 
ZERO sum0 7 multiply result = 0 
ZERO suml 7 multiply result = 0 
LOOP: trip 50 
LDDW *at++,ai:ail ; load ai & ait+l from memory 
LDDW *b++,bi:bil ; load bi & bit+l from memory 
MPYSP a0,b0,pi > ai * bi 
MPYSP al,bl,pil ; ait+l * bit+l 
ADDSP pi,sum0, sum0 ; sum0 += (ai * bi) 
ADDSP pil,suml1, suml ; suml += (aitl * bitl) 
[centr] SUB entr,1,cntr ; decrement loop counter 
[cntr] B LOOP 7 branch to loop 
ADDSP sum, suml, sum0 ; compute final result 


.-return sum 


-endproc 


7.5.3 Final Assembly 


Example 7-26 shows the assembly code for the fixed-point software-pipe- 
lined dot product in Table 7—7 on page 7-36. Example 7-27 shows the assem- 
bly code for the floating-point software-pipelined dot product in Table 7-8 on 
page 7-37. The accumulators are initialized to 0 and the loop counter is set up 
in the first execute packet in parallel with the first load instructions. The aster- 
isks in the comments correspond with those in Table 7—7 and Table 7-8, re- 


spectively. 


ee 


Note: 


All instructions executing in parallel constitute an execute packet. An exe- 


cute packet can contain up to eight instructions. 


See the TMS320C62x/C67x CPU and Instruction Set Reference Guide for 


more information about pipeline operation. 


td 
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7.5.3.1 Fixed-Point Example 


Multiple branch instructions are in the pipe. The first branch in the fixed-point 
dot product is issued on cycle 2 but does not actually branch until the end of 
cycle 7 (after five delay slots). The branch target is the execute packet defined 
by the label LOOP. On cycle 7, the first branch returns to the same execute 
packet, resulting in a single-cycle loop. On every cycle after cycle 7, a branch 
executes back to LOOP until the loop counter finally decrements to 0. Once 
the loop counter is 0, five more branches execute because they are already 
in the pipe. 


Executing the dot product code with the software pipelining as shown in 
Example 7-26 requires a total of 58 cycles (7 + 50 + 1), which is a significant 
improvement over the 402 cycles required by the code in Example 7-19. 


Te 


Note: 


The code created by the assembly optimizer will not completely match the 
final assembly code shown in this and future sections because different ver- 
sions of the tool will produce slightly different code. However, the inner loop 
performance (number of cycles per iteration) should be similar. 
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Example 7-26. Assembly Code for Fixed-Point Dot Product (Software Pipelined) 


LDW 2D, *R4++,A2 ; load ai & ait+l from memory 
LDW «D2 *B44++,B2 ; load bi & bit+l from memory 
MVK Si 50,Al1 ; set up loop counter 
ZERO Pie A7 ; zero out sum0 accumulator 
ZERO +L2 B7 ; zero out suml accumulator 
[Al] SUB +S Al,1,Al1 ; decrement loop counter 
LDW eD1 *A44++,A2 7* load ai & ait+l from memory 
LDW .D2 *B44++,B2 7* load bi & bit+l from memory 
[Al] SUB Sd Al,1,Al1 7* decrement loop counter 
[A1] B <S2 LOOP 7 branch to loop 
LDW ‘Dd *A44++,A2 7** load ai & ait+tl from memory 
LDW *D2 *B44++,B2 7** load bi & bit+tl from memory 
[Al] SUB «Si Al,1,Al1 7** decrement loop counter 
{A1] B +S2 LOOP 7* branch to loop 
LDW «D1 *A44++,A2 7*** load ai & ait+l from memory 
LDW .D2 *B44++,B2 7*** load bi & bitl from memory 
[Al] SUB “Sil Al,1,Al1 7*** decrement loop counter 
[Al] B “$2 LOOP 7** branch to loop 
LDW -D1 *A44++,A2 7**** load ai & aitl from memory 
LDW .D2 *B44++,B2 7**** load bi & bitl from memory 
MPY .M1X A2,B2,A6 ; ai * bi 
MPYH .M2X A2,B2,B6 ; ait+l * bitl 
[Al] SUB <i Al,1,Al1 ;**** decrement loop counter 
[Al] B «82 LOOP 7*** branch to loop 
LDW -D1 *A4++,A2 7x*x*** Td ai & ait+l from memory 
LDW 2D2 *B44++,B2 7xx*x** Td bi & bitl from memory 
MPY .M1X A2,B2,A6 ;* ai * bi 
MPYH .M2X A2,B2,B6 ;* aitl * bit+l 
[Al] SUB «Sl Al,1,Al1 7; **x*x*x* decrement loop counter 
[Al] B «82 LOOP 7**** branch to loop 
LDW .~bd *A4++,A2 7xxx*xe* Td ai & aitl from memory 
LDW 2D2 *B44++,B2 7xx*x*** Td bi & bitl from memory 
LOOP: 
ADD Lil A6,A7,A7 ; sum0 += (ai * bi) 
ADD .-L2 B6,B7,B7 ; suml += (aitl * bi+l) 
MPY .M1X A2,B2,A6 i** ai * bi 
MPYH .M2X A2,B2,B6 ;** ait+l * bitl 
[Al] SUB 81 Al1,1,Al1 ;xx*x*** decrement loop counter 
[Al] B «S82 LOOP 7; ***** branch to loop 
LDW Dl *A4++,A2 7xxxxxe* Td ai & ait+tl fm memory 
LDW «D2 *B44++,B2 7xxxxx*e* Td bi & bitl fm memory 
; Branch occurs here 
ADD sLLX A7,B7,A4 ; sum = sum0 + suml 
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7.5.3.2 Floating-Point Example 


The first branch in the floating-point dot product is issued on cycle 4 but does 
not actually branch until the end of cycle 9 (after five delay slots). The branch 
target is the execute packet defined by the label LOOP. On cycle 9, the first 
branch returns to the same execute packet, resulting in a single-cycle loop. On 
every cycle after cycle 9, abranch executes back to LOOP until the loop count- 
er finally decrements to 0. Once the loop counter is 0, five more branches 
execute because they are already in the pipe. 


Executing the floating-point dot product code with the software pipelining as 
shown in Example 7—27 requires a total of 74 cycles (9 + 50 + 15), whichis a 
significant improvement over the 508 cycles required by the code in 
Example 7-20. 


Example 7-27. Assembly Code for Floating-Point Dot Product (Software Pipelined) 


K 
RO 
RO 
DW 
DW 


NN 
GUUHHAS< 


LDDW 
DW 


E 
1S) 


LDDW 
LDDW 


+S1 50,Al1 7 set up loop counter 

Pa a A8 ; sum0 = 0 

-L2 B8 ; suml = 0 

-D1 A4++,A7:A6 ; load ai & ai + 1 from memory 

-D2 B4++,B7:B6 ; load bi & bi + 1 from memory 

Dd A4++,A7:A6 7* load ai & ai + 1 from memory 

-D2 B4++,B7:B6 7* load bi & bi + 1 from memory 

.DL. A4++,A7:A6 7** load ai & ai + 1 from memory 
»D2 B4++,B7:B6 7** load bi & bi + 1 from memory 
-D1 A4++,A7:A6 7*** load ai & ai + 1 from memory 
«D2 B4++,B7:B6 7*** load bi & bi + 1 from memory 
sol Al,1,Al1 ; Gecrement loop counter 

+ D1. A4++,A7:A6 7**** load ai & ai + 1 from memory 
-D2 B4++,B7:B6 7**** load bi & bi + 1 from memory 
252 LOOP 7 branch to loop 

sS1 Al,1,Al1 7* decrement loop counter 

-D1 A4++,A7:A6 7***** Load ai & ai + 1 from memory 
-D2 B4++,B7:B6 7***** load bi & bi + 1 from memory 
-M1X A6,B6,A5 7 pi = ad bd 

-M2X A7,B7,B5 ; pil = al bl 

.52 LOOP 7* branch to loop 

sak Al,1,Al1 7** decrement loop counter 

-D1 A4++,A7:A6 7x**x**** Load ai & ai + 1 from memory 
-D2 B4++,B7:B6 7x*x**** Load bi & bi + 1 from memory 
-M1X A6,B6,A5 7* pi = ad b0 

-M2X A7,B7,B5 7* pil = al bl 

»52 LOOP 7** branch to loop 

SL Al,1,Al1 7*** decrement loop counter 
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Example 7-27. Assembly Code for Floating-Point Dot Product (Software Pipelined) 


(Continued) 
LDDW <b A4++,A7:A6 7******* Load ai & ai + 1 from memory 
LDDW 2D2 B4++,B7:B6 7**x***** Load bi & bi + 1 from memory 
| MPYSP M1 A6,B6,A5 ;** pi = a0 b0 
MPYSP -M2X  A7,B7,B5 ;** pil = al bl 
[A1] B SZ LOOP 7*** branch to loop 
[Al] SUB oul Al1,1,Al1 7**** decrement loop counter 
LDDW -D1 A4++,A7:A6 pxxxxkxk**K*X Load ai & ai + 1 from memory 
LDDW sDZ B4++,B7:B6 pxx*e***K*X Load bi & bi + 1 from memory 
| MPYSP M1 A6,B6,A5 7;*** pi = a0 b0 
| MPYSP -M2X  A7,B7,B5 7*** pil = al bl 
[Al] B Oe LOOP 7**** branch to loop 
[Al] SUB sol Al,1,Al1 7***** decrement loop counter 
LOOP 
LDDW D A4++,A7:A6 pxxxxxkxK*K*X Load ai & ai + 1 from memory 
LDDW D2 B4++,B7:B6 pxxxxexxK*KX Load bi & bi + 1 from memory 
| MPYSP M1 A6,B6,A5 ;**** pi = a0 bo 
| MPYSP -M2X  A7,B7,B5 p**** pil = al bil 
| ADDSP L A5,A8,A8 ; sum0 += (ai bi) 
ADDSP -L2 B5,B8,B8 suml += (ait+l bi+1l) 
| [Al] B soe LOOP 7***** branch to loop 
| [Al] SUB Si Al,1,Al1 7***xx*x* decrement loop counter 
; Branch occurs here 
ADDSP » LIX A8,B8,A0 ; sum(0) = sum0(0) + suml (0) 
ADDSP ~L2X A8,B8,BO ; sum(1) = sum0(1) + suml (1) 
ADDSP -L1X A8,B8,A0 7 sum(2) = sum0(2) + suml (2) 
ADDSP ~L2X A8,B8,BO ; sum(3) = sum0(3) + suml (3) 
NOP ; wait for BO 
ADDSP «IX AO,BO,A5 ; sum(01) = sum(0) + sum(1) 
NOP ; wait for next BO 
ADDSP -L2X AO,BO,B5 ; sum(23) = sum(2) + sum(3) 
NOP 3 
ADDSP ~L1X A5,B5,A4 ; sum = sum(01) + sum(23) 
NOP 3 , 
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7.5.3.3. Removing Extraneous Instructions 


The code in Example 7—26 and Example 7-27 executes extra iterations of 
some of the instructions in the loop. The following operations occur in parallel 
on the last cycle of the loop in Example 7-26: 


[1 Iteration 50 of the ADD instructions 
[1 Iteration 52 of the MPY and MPYH instructions 
[1 Iteration 57 of the LDW instructions 


The following operations occur in parallel on the last cycle of the loop in 
Example 7-27: 


(1 Iteration 50 of the ADDSP instructions 
(1 Iteration 54 of the MPYSP instructions 
(j Iteration 59 of the LDDW instructions 


In most cases, extra iterations are not a problem; however, when extraneous 
LDWs and LDDWs access unmapped memory, you can get unpredictable re- 
sults. If the extraneous instructions present a potential problem, remove the 
extraneous load and multiply instructions by adding an epilog like that included 
in the second part of Example 7-28 on page 7-48 and Example 7—29 on 
page 7-49. 


Fixed-Point Example 


To eliminate LDWs in the fixed-point dot product from iterations 51 through 57, 
run the loop seven fewer times. This brings the loop counter to 43 (50 — 7), 
which means you still must execute seven more cycles of ADD instructions 
and five more cycles of MPY instructions. Five pairs of MPYs and seven pairs 
of ADDs are now outside the loop. The LDWs, MPYs, and ADDs all execute 
exactly 50 times. (The shaded areas of Example 7-28 indicate the changes 
in this code.) 


Executing the dot product code in Example 7—28 with no extraneous LDWs 
still requires a total of 58 cycles (7 + 43 + 7+ 1), but the code size is now larg- 
er. 


Floating-Point Example 


To eliminate LDDWs in the floating-point dot product from iterations 51 through 
59, run the loop nine fewer times. This brings the loop counter to 41 (50 — 9), 
which means you still must execute nine more cycles of ADDSP instructions 
and five more cycles of MPYSP instructions. Five pairs of MPYSPs and nine 
pairs of ADDSPs are now outside the loop. The LDDWs, MPYSPs, and 


Software Pipelining 


ADDSPs all execute exactly 50 times. (The shaded areas of Example 7-29 in- 
dicate the changes in this code.) 


Executing the dot product code in Example 7—29 with no extraneous LDDWs 
still requires a total of 74 cycles (9 + 41 +9 +15), but the code size is now larg- 
er. 


Example 7-28. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With No Extraneous Loads) 


LDW -D1 *R4++,A2 ; load ai & ai+l from memory 
LDW .D2 *B44++,B2 ; load bi & bit+l from memory 
MVK pisul 43,Al1 ; set up loop counter 
ZERO -L1 A7 ; zero out sum0 accumulator 
ZERO -L2 B7 ; zero out suml accumulator 
[Al] SUB SL Al,1,A1 ; decrement loop counter 
LDW «Di *A44++,A2 7* load ai & ait+l from memory 
LDW .D2 *B44++,B2 7* load bi & bit+l from memory 
[Al] SUB -S1 A1l,1,Al1 7* decrement loop counter 
[Al] B ~52 LOOP 7 branch to loop 
LDW +Di *A44++,A2 7** load ai & aitl from memory 
LDW ~D2 *B44++,B2 7** load bi & bit+tl from memory 
[Al] SUB «$1 Al,1,Al1 7** decrement loop counter 
[Al] B «52 LOOP 7* branch to loop 
LDW -D1 *A44++,A2 7*** load ai & ait+tl from memory 
LDW 7D2 *B44++,B2 7*** load bi & bitl from memory 
[Al] SUB oul Al,1,Al1 7*** decrement loop counter 
[Al] B #S2 LOOP 7** branch to loop 
LDW «D1 *A44++,A2 7**** load ai & aitl from memory 
LDW «D2 *B44++,B2 7**** load bi & bitl from memory 
MPY -M1X A2,B2,A6 ; ai * bi 
MPYH .M2X A2,B2,B6 ; ait+l * bitl 
[Al] SUB <ol Al,1,Al1 7**** decrement loop counter 
[Al] B soe LOOP 7*** branch to loop 
LDW -D1 *A4++,A2 7**x*** Td ai & ait+tl from memory 
LDW -D2 *B44++,B2 7x*x*x** Td bi & bitl from memory 
MPY .M1X A2,B2,A6 ;* ai * bi 
MPYH .M2X A2,B2,B6 ;* aitl * bit+l 
[Al] SUB sol Al,1,Al1 7;**x*x*x** decrement loop counter 
[Al] B «82 LOOP 7**** branch to loop 
LDW .D1 *A4++,A2 7xx*x*** Td ai & aitl from memory 
LDW .D2 *B44++,B2 7xxx*e** Td bi & bitl from memory 
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Example 7-28. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued) 


LOOP: 


LDW 


-L1 A6é,A7,A7 
-L2 B6,B7,B7 
.M1X A2,B2,A6 
.M2X A2,B2,B6 
-S1 Al1,1,Al 


-S2 LOOP 
sD: *A4++,A2 
-D2 *B4++,B2 


; Branch occurs here 


pdlnal A6,A7,A7 
olby2) Boy Bie, 
.M1X A2,B2,A6 
-M2X A2,B2,B6 


Sdbgll A6,A7,A7 
pdb, B6,B7,B7 
.M1X A2,B2,A6 
.M2X A2,B2,B6 


odbill A6,A7,A7 
cdl Be, Bi, Bi 
.M1X A2,B2,A6 
.M2X A2,B2,B6 


pdinll A6,A7,A7 
5 dbz IEG ip 1B) IBV) 
.M1X A2,B2,A6 
.M2X A2,B2,B6 


cabnll A6,A7,A7 
cdl B6,B7,B7 
.M1X A2,B2,A6 
.M2X A2,B2,B6 


o diydl A6,A7,A7 
paki, 1G ip 1B) p IBY 


oaliil A6,A7,A7 
ply, B6,B7,B7 


.L1X A7,B7,A4 


, 
a 


, 


sum0O += (ai * bi) 
suml += (ait+l * bitl1) 


;** ai * bi 


7** aitl * bitl 

7x*x*xx*x* decrement loop counter 
7***** branch to loop 

7 xxxxee* Td ai & ait+tl fm memory 
7xxxx*e*k* Td bi & bitl fm memory 


r 
Uf 


, 


ADDs 


sum0 += (ai * bi) @) 


Ssuml += {ais * bail) 


Quiet il = Joyal 


7** aitl * bit+l 


iu 
, 


iy 


sum0 += (ai * bi) @) 
suml += (aitl * bi+l) 


Rees ESL = loyal 


3** aitl * bit+l 


if 
, 


, 


sum0 += (ai * bi) (3) 
suml += (ait+l * bi+1) 


;** ai * bi 


;** aitl * bitl 


, 


if 


sum0 += (ai * bi) @) 
suml += (ait+l * bi+1) 
x6 vas bal 


3** ait+l * bit+l 


fi 


, 


sum0 += (ai * bi) @) 
suml += (ait+l * bit+t1) 
wee lal “7 Jeyst. 


3** aitl * bit+l 


sum0 += (ai * bi) ©) 


suml += (ait+l * bit+t1) 


sum0 += (ai * bi) @ 
suml += {aitil * bit) 
sum = sum0 + suml 


MPYs 
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Example 7-29. Assembly Code for Floating-Point Dot Product (Software Pipelined 
With No Extraneous Loads) 


[Al] 


n 
fa 
Ww 


[Al] 
[Al] 


NORE 


[Al] 
[Al] 


QWEEE 
rg 
K 
un 
ie) 


[Al] 
[Al] 


Ee OOSBSER 
K 
n 
ue] 


DDW 
MPYSP 
MPYSP 

[Al] B 

[Al] SUB 


LDDW 
LDDW 
MPYSP 
MPYSP 
[Al] B 
[Al] SUB 


Ani eA! 

A8 

B8 
A4++,A7:A6 
B4++,B7:B6 


A4++,A7:A6 
B4++,B7:B6 


A4++,A7:A6 
B4++,B7:B6 


A4++,A7:A6 
B4++,B7:B6 
Al1,1,A1 


A4++,A7:A6 
B4++,B7:B6 
LOOP 
Al,1,Al1 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
LOOP 
Al,1,Al1 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
LOOP 
Al,1,Al1 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
LOOP 
Al,1,A1 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
LOOP 
Al,1,A1 


7 set up loop counter 

; sum0 = 0 

; suml = 0 

; load ai & ai + 1 from memory 
; load bi & bi + 1 from memory 


7* load ai & ai + 1 from memory 
7* load bi & bi + 1 from memory 


7** load ai & ai + 1 from memory 
7** load bi & bi + 1 from memory 


7*** load ai & ai + 1 from memory 
7*** load bi & bi + 1 from memory 
; decrement loop counter 


7**** load ai & ai + 1 from memory 
7**** load bi & bi + 1 from memory 
; branch to loop 

7* decrement loop counter 


7***** load ai & ai + 1 from memory 
7***** load bi & bi + 1 from memory 
; pi = a0 b0 

7; pil =al bl 

7* branch to loop 

7** decrement loop counter 


7x*x**** Load ai & ai + 1 from memory 
7x**x**** Load bi & bi + 1 from memory 
7* pi = a0 b0 

Pe pill = ad. bl 

7** branch to loop 

7*** decrement loop counter 


7xx*x**** Load ai & ai + 1 from memory 
7x*x***** Load bi & bi + 1 from memory 
7** pi = ad b0 

Pes pal = al bi 

7*** branch to loop 

7**** decrement loop counter 


7 xxx Load ai & ai + 1 from memory 
7 xxX***AX*X Load bi & bi + 1 from memory 


7*** pi = ad b0 

7*** pil = al bil 

7**** branch to loop 

7**x*** decrement loop counter 
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Example 7-29. Assembly Code for Floating-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued 


r 


-D1 
-D2 
.M1X 
-M2X 
L1 
2L2 
~S2 
21 


Branch occurs here 


MPYSP 
MPYSP 
ADDSP 
ADDSP 


MPYSP 
MPYSP 
ADDSP 
ADDSP 


MPYSP 
MPYSP 
ADDSP 
ADDSP 


PYSP 
PYSP 
DDSP 
DDSP 


PMISP 
PYSP 
DSP 


M 
M 
A 
A 
M 
M 
A 
ADDSP 


D 
D 
DDSP 


DDSP 


ADDSP 
ADDSP 


ADDSP 
ADDSP 


ADDSP 
ADDSP 


.M1X 
-M2X 
ealnil 
ollie, 


.M1X 
-M2X 
call 
ollie 


.M1X 
-M2X 
cali 
ollie, 


.M1X 
-M2X 
calnil 
elle) 


.M1X 
-M2X 
pall 
oily, 


oalil 
-L2 


o Abril 
Le 


oleal 
> 2 


oll, 
pally 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 
LOOP 
Al,1,Al1 


A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 


A6,B6,A5 
le BS 
A5,A8,A8 
B5,B8,B8 


A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 


A6,B6,A5 
IND BY ND 
A5,A8,A8 
B5,B8,B8 


A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


7p xxx Load al & ai + 1 from memory 


td 
a 
a 
td 


a 


KaKKKKK KKK load bi 
**** pi = a0 b0 

**** pil = al bil 
sum0 += (ai bi) 
suml += (ai+l bitl) 


7***** branch to loop 
7***x*x** decrement loop counter 


pi = a0 b0 
pil = al bl 


sum0 += (ai _ bi) 
Sumia =i ( ala P lio aa) 
pi = a0 b0 

pil = al bl 

sumON += (allen) 
suml += (ait+l bitl1) 
pi = a0 b0 

pil = al bl 

sum0N += (alan) 
Sumil t= (alae Pleo ails) 
pi = a0 b0 

pil = al bl 

SumOR =a (allio) 
suml += (ai+l bitl) 
pi = a0 b0 

pill eal bul 

sum0 += (ai bi) 
suml += (ai+l bitl1) 
sum0 += (ai bi) 
suml += (ai+l bitl1) 
sum0l+= (aa on) 
suml += (ait+l bitl1) 
sum0O += (ai bi) 
suml += (aitl bi+1) 
sum0 += (ai _ bi) 
suml += {ai-l bit) 


ADDSPs 


© 


@ 


© 


© 5 BO fem fe 


& bi + 1 from memory 


MPYSPs 


© 


® 
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Example 7-29. Assembly Code for Floating-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued) 


ADDSP 


ADDSP 


ADDSP 


ADDSP 


NOP 


ADDSP 


NOP 


ADDSP 


NOP 


ADDSP 


NOP 


.L1X 


.L2X 


.L1X 


.L2X 


.L1X 


-L2X 


.L1X 


A8,B8,A0 
A8,B8,BO 
A8,B8,A0 


A8,B8,BO 


AO,BO,A5 


AO,BO,B5 
3 
A5,B5,A4 


3 


, 


sum(0) = sum0 (0) 
sum(1) = sum0 (1) 
sum(2) = sum0 (2) 
sum(3) = sum0 (3) 


wait for BO 
sum(01) = sum(0) 
wait for next BO 


sum(23) = sum(2) 


suml (0) 
sum (1) 
suml (2) 


suml (3) 


sum (1) 


sum (3) 


sum = sum(01) + sum(23) 
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7.5.3.4 Priming the Loop 


Although Example 7-28 and Example 7—29 execute as fast as possible, the 
code size can be smaller without significantly sacrificing performance. To help 
reduce code size, you can use a technique called priming the loop. Assuming 
that you can handle extraneous loads, start with Example 7-26 or 
Example 7-27, which do not have epilogs and, therefore, contain fewer 
instructions. (This technique can be used equally well with Example 7—28 or 
Example 7-29.) 


Fixed-Point Example 


To eliminate the prolog of the fixed-point dot product and, therefore, the extra 
LDW and MPY instructions, begin execution at the loop body (at the LOOP 
label). Eliminating the prolog means that: 


(1 Two LDWs, two MPYs, and two ADDs occur in the first execution cycle of 
the loop. 


(41 Because the first LDWs require five cycles to write results into a register, 
the MPYs do not multiply valid data until after the loop executes five times. 
The ADDs have no valid data until after seven cycles (five cycles for the 
first LDWs and two more cycles for the first valid MPYs). 


Example 7—30 shows the loop without the prolog but with four new instructions 
that zero the inputs to the MPY and ADD instructions. Making the MPYs and 
ADDs use Os before valid data is available ensures that the final accumulator 
values are unaffected. (The loop counter is initialized to 57 to accommodate 
the seven extra cycles needed to prime the loop.) 


Because the first LDWs are not issued until after seven cycles, the code in 
Example 7—30 requires a total of 65 cycles (7 + 57+ 1). Therefore, you are re- 
ducing the code size with a slight loss in performance. 


Software Pipelining 


Example 7-30. Assembly Code for Fixed-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) 


MVK 261 57,Al1 7 set up loop counter 
[Al] SUB oil Al,1,Al1 ; decrement loop counter 
ZERO -L1 A7 ; zero out sum0 accumulator 
ZERO -L2 B7 ; zero out suml accumulator 
[Al] SUB “Sal Al,1,A1 7* decrement loop counter 
[Al] B +52 LOOP ; branch to loop 
ZERO oil, A6 ; zero out add input 
ZERO ebi2 B6 ; zero out add input 
[Al] SUB +S1 Al,1,Al1 7** decrement loop counter 
[Al] B «S2 LOOP 7* branch to loop 
ZERO -L1 A2 ; zero out mpy input 
ZERO -L2 B2 ; zero out mpy input 
Al] SUB -S1 Al,1,Al1 7*** decrement loop counter 
Al] B <o2 LOOP 7** branch to loop 
Al] SUB Sak Al,1,Al1 ;**** decrement loop counter 
Al] B go2 LOOP 7*** branch to loop 
Al] SUB Sl A1l,1,Al1 7; **x*** decrement loop counter 
[Al] B «SZ LOOP 7**** branch to loop 
LOOP: 
ADD Pay rt A6,A7,A7 ; sum0 += (ai * bi) 
ADD .L2 B6,B7,B7 ; suml += (ait+l * bi+l) 
MPY -M1X A2,B2,A6 BERS aia pak 
MPYH -M2X A2,B2,B6 gee ai¢l] * bit] 
[Al] SUB acai Al1,1,A1 7; xx*x*x*x* decrement loop counter 
[Al] B «S52 LOOP 7; ***** branch to loop 
LDW Di *A44++,A2 7xxxxe** Td ai & ait+tl fm memory 
LDW .D2 *B44++,B2 7xxxxee* Td bi & bitl fm memory 
; Branch occurs here 
ADD .L1X A7,B7,A4 ; sum = sum0O + suml 
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Floating-Point Example 


To eliminate the prolog of the floating-point dot product and, therefore, the 
extra LDDW and MPYSFP instructions, begin execution at the loop body (at the 
LOOP label). Eliminating the prolog means that: 


1 Two LDDWs, two MPYSPs, and two ADDSPs occur in the first execution 
cycle of the loop. 


(1 Because the first _DDWs require five cycles to write results into a register, 
the MPYSPs do not multiply valid data until after the loop executes five 
times. The ADDSPs have no valid data until after nine cycles (five cycles 
for the first LDDWs and four more cycles for the first valid MPYSPs). 


Example 7-31 shows the loop without the prolog but with four new instructions 
that zero the inputs to the MPYSP and ADDSP instructions. Making the 
MPYSPs and ADDSPs use Os before valid data is available ensures that the 
final accumulator values are unaffected. (The loop counter is initialized to 59 
to accommodate the nine extra cycles needed to prime the loop.) 


Because the first LDDWs are not issued until after nine cycles, the code in 
Example 7-31 requires a total of 81 cycles (7 + 59+ 15). Therefore, you are 
reducing the code size with a slight loss in performance. 


Example 7-31. Assembly Code for Floating-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) 


MVK 7S 59,Al1 ; set up loop counter 
ZERO +1. A7 7 zero out mpysp input 
| ZERO -L2 B7 7 zero out mpysp input 
| [Al] SUB Sigal: A1,1,Al1 ; decrement loop counter 
[Al] B 2S2 LOOP 7 branch to loop 
[Al] SUB od A1,1,Al1 7* decrement loop counter 
ZERO Py ra A8 ; zero out sum0O accumulator 
ZERO Py ee, B8 ; zero out sum0O accumulator 
[Al] B sS2 LOOP 7* branch to loop 
[Al] SUB .S1 A1,1,Al1 7** decrement loop counter 
ZERO -L1 A5 7 zero out addsp input 
ZERO ~L2 B5 7 zero out addsp input 
[Al] B S2 LOOP 7** branch to loop 
[Al] SUB iol A1,1,Al1 7*** decrement loop counter 
ZERO -L1 A6é 7 zero out mpysp input 
ZERO -L2 B6 7 zero out mpysp input 


Software Pipelining 


Example 7-31. Assembly Code for Floating-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) (Continued) 


[Al] B ~S2 LOOP 7*** branch to loop 

[Al] SUB Sl Al,1,Al1 7**** decrement loop counter 

[Al] B -S2 LOOP 7**** branch to loop 

[Al] SUB sol Al1,1,Al1 7**x*** decrement loop counter 

LOOP: 

LDDW ~DiL A4++,A7:A6 7 xxxxxkeKK*X Load ai & ai + 1 from memory 
LDDW «D2 B4++,B7:B6 7 xxxxaKKKK Load bi & bi + 1 from memory 
MPYSP .M1X A6,B6,A5 7;**** pi = a0 b0 
MPYSP -M2X A7,B7,B5 eek oad = ail Il 
ADDSP Li A5,A8,A8 ; sum0 += (ai bi) 
ADDSP Pa B5,B8,B8 ; suml += (ait+l bitl) 

[Al] B .S2 LOOP 7 ***** branch to loop 

[Al] SUB Peal Al,1,Al1 7 x*x*x*x** decrement loop counter 

; Branch occurs here 

ADDSP » LIX A8,B8,A0 ; sum(0) = sum0(0) + suml (0) 
ADDSP ~L2X A8,B8,BO ; sum(1) = sum0(1) + suml (1) 
ADDSP -L1X A8,B8,A0 7; sum(2) = sum0(2) + suml (2) 
ADDSP »L2X A8,B8,BO ; sum(3) = sum0(3) + sumil (3) 
NOP ; wait for BO 
ADDSP -hix AO,BO,A5 ; sum(01) = sum(0) + sum(1) 
NOP ; wait for next BO 
ADDSP ~L2X AO,BO,B5 ; sum(23) = sum(2) + sum(3) 
NOP 3 
ADDSP »L1xX A5,B5,A4 7; sum = sum(01) + sum(23) 
NOP 3 r 
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7.5.3.5 Removing Extra SUB Instructions 


To reduce code size further, you can remove extra SUB instructions. If you 
know that the loop count is at least 6, you can eliminate the extra SUB instruc- 
tions as shown in Example 7-32 and Example 7-33. The first five branch 
instructions are made unconditional, because they always execute. (If you do 
not know that the loop count is at least 6, you must keep the SUB instructions 
that decrement before each conditional branch as in Example 7-30 and 
Example 7-31.) Based on the elimination of six SUB instructions, the loop 
counter is now 51 (57 — 6) for the fixed-point dot product and 53 (59 — 6) for 
the floating-point dot product. This code shows some improvement over 
Example 7-30 and Example 7-31. The loop in Example 7-32 requires 63 
cycles (5 + 57 + 1) and the loop in Example 7-31 requires 79 cycles 
(5+ 59 + 15). 


Example 7-32. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With Smallest Code Size) 


B «2 LOOP 7 branch to loop 

MVK -S1 51,Al1 ; set up loop counter 

B 382 LOOP 7* branch to loop 

B S82 LOOP 7** branch to loop 

ZERO 1 A7 ; zero out sum0O accumulator 
ZERO 12 B7 ; zero out suml accumulator 

B ~S2 LOOP 7*** branch to loop 

ZERO 1 A6é 7 zero out add input 

ZERO 12 B6 7 zero out add input 

B S2 LOOP 7 **** HDranch to: loop 

ZERO 1 A2 7 zero out mpy input 

ZERO 12 B2 ; zero out mpy input 

ADD Ll A6,A7,A7 ; sumO += (ai * bi) 

ADD L2 B6,B7,B7 ; suml += (ait+tl * bit1) 

MPY M1X A2,B2,A6 7** ai * bi 

MPYH  .M2X A2,B2,B6 3** aitl * bitl 

SUB yk Al1,1,Al1 7*x*x**x** decrement loop counter 
B Pes LOOP 7***** branch to loop 

LDW sD *A4++,A2 7xxxxxe*e* Td ai & ait+tl fm memory 
LDW «D2 *B4++,B2 7xxxx**k* Td bi & bitl fm memory 
; Branch occurs here 

ADD ~L1X A7,B7,A4 ; sum = sum0 + suml 
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Example 7-33. Assembly Code for Floating-Point Dot Product (Software Pipelined 


With Smallest Code Size) 


B ~S2 
MVK .S1 
B ~S2 
ZERO ~L1 
ZERO ~L2 
B ~S2 
ZERO Pa al 
ZERO L2 
B ~S2 
ZERO .L1 
ZERO ~L2 
B ~S2 
ZERO -L1 
ZERO ~L2 
LOOP: 
LDDW sD, 
LDDW -D2 
MPYSP .M1X 
MPYSP .M2X 
ADDSP .L1 
ADDSP .L2 
[Al] B «52 
[Al] SUB Sl 
; Branch occurs here 
ADDSP .L1X 
ADDSP .L2X 
ADDSP .L1X 
ADDSP .L2X 
NOP 
ADDSP .L1X 
NOP 
ADDSP .L2X 
NOP 
ADDSP .L1X 
NOP 


LOOP 
53,Al1 


LOOP 
Al 
B7 


LOOP 
A8 
B8 
LOOP 
A5 
BS 
LOOP 


A6 
B6é 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 
LOOP 
Al,1,Al1 
A8,B8,A0 
A8,B8,BO 
A8,B8,A0 


A8,B8,BO 


AO,BO,A5 


AO,BO,B5 
3 
A5,B5,A4 


3 


r 


a 


r 


7**** bran 


, 


, 


branch t 
set up 1 


* branch 
zero out 
zero out 

** branc 
zero ou 
zero ou 


co Ss 


zero ou 
zero ou 


*** branc 


zero ou 
zero ou 


pRRKRKKKER 


pRRKKKKKER 


peewee pi = 


peewee pil 


, 


a 


sum0O += 
suml += 


peewee bra 


7*x*x*x*x*x* decrement loop counter 


sum(0) = sum0(0) + sumi (0) 
sum(1) = sum0(1) + sumi (1) 
sum(2) = sum0(2) + suml (2) 
sum(3) = sum0(3) + suml (3) 
wait for BO 

sum(01) = sum(0) + sum(1) 
wait for next BO 

sum(23) = sum(2) + sum(3) 
sum = sum(01) + sum(23) 


o loop 
oop counter 


to loop 
mpysp input 
mpysp input 


to loop 
sum0 accumulator 
sum0 accumulator 


h to loop 
addsp input 
addsp input 


ch to loop 
mpysp input 
mpysp input 


load ai & ai + 1 from memory 
load bi & bi + 1 from memory 


ad b0 
=al bl 
(ai bi) 
(ai+l bitl) 
nch to loop 
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7.5.4 Comparing Performance 


Table 7-10 compares the performance of all versions of the fixed-point dot 
product code. Table 7-11 compares the performance of all versions of the 


floating-point dot product code. 


Table 7-10. Comparison of Fixed-Point Dot Product Code Examples 


Code Example 
Example 7-9 _ Fixed-point dot product linear assembly 


Example 7-10 Fixed-point dot product parallel assembly 


100 Iterations 
2+100 x 16 


1+ 100 x8 


Example 7-19 Fixed-point dot product parallel assembly with LDW 1+ (50 x 8) +1 

Example 7-26 Fixed-point software-pipelined dot product 7+50+1 

Example 7-28 Fixed-point software-pipelined dot product with no extrane- 7+4384+74+1 
ous loads 

Example 7-30 Fixed-point software-pipelined dot product with no prolog or 7+57+4+1 
epilog 

Example 7-32 Fixed-point software-pipelined dot product with smallest 5+57+4+1 
code size 

Table 7-11. Comparison of Floating-Point Dot Product Code Examples 


Code Example 


Example 7-11 Floating-point dot product nonparallel assembly 
Example 7-12 Floating-point dot product parallel assembly 

Example 7-20 Floating-point dot product parallel assembly with LDDW 
Example 7-27 Floating-point software-pipelined dot product 


Example 7-29 Floating-point software-pipelined dot product with no extra- 
neous loads 


Floating-point software-pipelined dot product with no prolog 
or epilog 


Example 7-31 


Example 7-33 Floating-point software-pipelined dot product with small- 
est code size 


100 Iterations 
2+100 x 21 


1+100 x 10 


1+(50 x 10) +7 


9+50+15 
9+41+94+15 


7+59+15 


5+594+15 


Cycle Count 
1602 


801 
402 
58 
58 


65 


63 


Cycle Count 
2102 


1001 
508 
74 
74 


81 


79 


Moadulo Scheduling of Multicycle Loops 


7.6 Modulo Scheduling of Multicycle Loops 


Section 7.5 demonstrated the modulo-scheduling technique for the dot 
product code. In that example of a single-cycle loop, none of the instructions 
used the same resources. Multicycle loops can present resource conflicts 
which affect modulo scheduling. This section describes techniques to deal 
with this issue. 


7.6.1 Weighted Vector Sum C Code 


Example 7-34 shows the C code for a weighted vector sum. 


Example 7-34. Weighted Vector Sum C Code 


void w_vec(short a[],short b[],short c[],short m) 
{ 


int i; 


for (i=0; i<100; i++) { 
c[i] = ((m * a[i]) >> 15) + bli]; 
} 


7.6.2 Translating C Code to Linear Assembly 


Example 7-35 shows the linear assembly that executes the weighted vector 
sum in Example 7—34. This linear assembly does not have functional units as- 
signed. The dependency graph will help in those decisions. However, before 
looking at the dependency graph, the code can be optimized further. 


Example 7-35. Linear Assembly for Weighted Vector Sum Inner Loop 


LDH *aptrt+t+,ai a 

LDH *bptr++,bi ; bi 

MPY m,ai,pi 7; m * ai 

SHR pi,15,pi_scaled 7; (m * ai) >> 15 

ADD pi_scaled,bi,ci 7; ci = (m * ai) >> 15 + bi 

STH ci, *cptr+t+ ; store ci 
[cntr] SUB entr,1,cntr ; decrement loop counter 
[centr]B LOOP 7 branch to loop 
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7.6.3. Determining the Minimum Iteration Interval 


Example 7—35 includes three memory operations in the inner loop (two LDHs 
and the STH) that must each use a .D unit. Only two .D units are available on 
any single cycle; therefore, this loop requires at least two cycles. Because no 
other resource is used more than twice, the minimum iteration interval for this 
loop is 2. 


Memory operations determine the minimum iteration interval in this example. 
Therefore, before scheduling this assembly code, unroll the loop and perform 
LDWs to help improve the performance. 


7.6.3.1 Unrolling the Weighted Vector Sum C Code 


Example 7—36 shows the C code for an unrolled version of the weighted vector 
sum. 


Example 7-36. Weighted Vector Sum C Code (Unrolled) 


void w_vec(short a[],short b[],short c[],short m) 
{ 


int i; 


for (i=0; i<100; i+=2) { 
e[i] = ((m * a[i]) >> 15) + b[i]; 
c[itl] = ((m * a[itl]) >> 15) + b[itl]; 
} 
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7.6.3.2 Translating Unrolled Inner Loop to Linear Assembly 


Example 7-37 shows the linear assembly that calculates c[i] and c[i+1] for the 
weighted vector sum in Example 7-36. 


(J The two store pointers (*ciptr and *ci+1ptr) are separated so that one 
(*ciptr) increments by 2 through the odd elements of the array and the 
other (*ci+1ptr) increments through the even elements. 


_) AND and SHR separate bi and bi+1 into two separate registers. 


_j This code assumes that mask is preloaded with OxOOOOFFFF to clear the 
upper 16 bits. The shift right of 16 places bi+1 into the 16 LSBs. 


Example 7-37. Linear Assembly for Weighted Vector Sum Using LDW 


[entr]s 
[entr]B 


an rrHNHNPrHHNE RHH 


PY 


Z 
iw) 


*aptr++,ai_itl 7 
*bptr++,bi_itl : 
m,ai_it+l,pi : 
m,ai_i+1l,pitl $ 
pi,15,pi_scaled ; 
pitl1,15,pitil_scaled 7 
bi_itl,mask,bi ; 
bi_it+1,16,bitl 7 bitl 
pi_scaled,bi,ci - 
pitl_scaled,bi+1,citl H 
ci, *ciptrt++[2] Hi 
citl, *citlptr++[2] . 
ecntr,1,cntr ; 

i 


LOOP 


ai & aitl 

bi & bi+l 

m * ai 

m * ait+l 

(m * ai) >> 15 
(m * aitl) >> 15 
bi 


ci = (m * ai) >> 15 + bi 

citl = (m * aitl) >> 15 + bitl 
store ci 

store citl 

decrement loop counter 

branch to loop 


7.6.3.3 Determining a New Minimum Iteration Interval 


Use the following considerations to determine the minimum iteration interval 
for the assembly instructions in Example 7-37: 


.) Four memory operations (two LDWs and two STHs) must each use a .D 
unit. With two .D units available, this loop still requires only two cycles. 


(J Fourinstructions must use the .S units (three SHRs and one branch). With 
two .S units available, the minimum iteration interval is still 2. 


_j The two MPYs do not increase the minimum iteration interval. 


(| Because the remaining four instructions (two ADDs, AND, and SUB) can 
all use a .L unit, the minimum iteration interval for this loop is the same as 
in Example 7-35. 


By using LDWs instead of LDHs, the program can do twice as much work in 
the same number of cycles. 
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7.6.4 Drawing a Dependency Graph 


To achieve a minimum iteration interval of 2, you must put an equal number 
of operations per unit on each side of the dependency graph. Three operations 
in one unit on a side would result in an minimum iteration interval of 3. 


Figure 7-11 shows the dependency graph divided evenly with a minimum it- 
eration interval of 2. 


Figure 7-11. Dependency Graph of Weighted Vector Sum 
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7.6.5 Linear Assembly Resource Allocation 
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Using the dependency graph, you can allocate functional units and registers 
as shown in Example 7-38. This code is based on the following assumptions: 


_) The pointers are initialized outside the loop. 
_} mresides in B6, which causes both .M units to use a cross path. 
_) The mask in the AND instruction resides in B10. 


Example 7-38. Linear Assembly for Weighted Vector Sum With Resources Allocated 


LDW ~Da 
LDW -D2 
MPY -M1X 
MPYHL .M2xX 
SHR -S1 
SHR O28 
AND +L2 
SHR -S2 
ADD .L1X 
ADD ~L2 
STH + D1 
STH -D2 
[Al] SUB shi 
[Al] B Sl 


*A4++,A2 
*B4++,B2 
A2,B6,A5 
A2,B6,B5 
A5,15;A7 
B5,15,B7 
B2,B10,B8 
B2,16,Bl1 
A7,B8,A9 
B7,B1,B9 
AQ, *A6++[2] 
B9, *BO++[2] 
Al,1,Al1 
LOOP 


i Te i ee Te 


ai & ai+l 


bi & bitl 

pi =m * ai 

pit+l =m * ait+l 

pi_scaled = (m * ai) >> 15 
pitl_scaled = (m * ait+l) >> 15 
bi 

bit+l 

ci = (m * ai) >> 15 + bi 

citl = (m * aitl) >> 15 + bitl 


store ci 

store cit+l 

decrement loop counter 
branch to loop 


7.6.6 Modulo Iteration Interval Scheduling 


Table 7-12 provides a method to keep track of resources that are a modulo 
iteration interval away from each other. In the single-cycle dot product exam- 
ple, every instruction executed every cycle and, therefore, required only one 
set of resources. Table 7-12 includes two groups of resources, which are 
necessary because you are scheduling a two-cycle loop. 


LJ Instructions that execute on cycle k also execute on cycle k + 2, k + 4, etc. 
Instructions scheduled on these even cycles cannot use the same 


resources. 


(J Instructions that execute on cycle k + 1 also execute on cyclek +3,k +5, 
etc. Instructions scheduled on these odd cycles cannot use the same 


resources. 


(J Because two instructions (MPY and ADD) use the 1X path but do not use 
the same functional unit, Table 7-12 includes two rows (1X and 2X) that 
help you keep track of the cross path resources. 
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Only seven instructions have been scheduled in this table. 


Ly 


a 


The two LDWs use the .D units on the even cycles. 


The MPY and MPYH are scheduled on cycle 5 because the LDW has four 
delay slots. The MPY instructions appear in two rows because they use 
the .M and cross path resources on cycles 5, 7, 9, etc. 


The two SHR instructions are scheduled two cycles after the MPY to allow 
for the MPY’s single delay slot. 


The AND is scheduled on cycle 5, four delay slots after the LDW. 


Moadulo Scheduling of Multicycle Loops 


Table 7-12. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


Note: 


Optimizing Assembly Code via Linear Assembly 


The asterisks indicate the iteration of the loop; shaded cells indicate cycle 0. 


Unit/Cycle 0 2 4 6 8 10 
.D1 ea i Se @ 2 5 
LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
.D2 i - we a a 
LDW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
.M1 
.M2 
LI 
AL2 
St 
S2 
1X 
2X 
Unit/Cycle 1 3 5 7 9 11 
.D1 
.D2 
.M1 ; ' : . 
MPY pi MPY pi MPY pi MPY pi 
.M2 ; : . ; 
MPYHL pi+1 | MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 
LI ; ; ; ‘ 
AND bi AND bi AND bi AND bi 
.L2 
St F : : 
SHR pi_s SHR pi_s SHR pi_s 
S2 , ; F 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
1X ; , , ‘ 
MPY pi MPY pi MPY pi MPY pi 
2X ; F : : 
MPYHL pi+1 | MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 
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7.6.6.1 Resource Conflicts 


Resources from one instruction cannot conflict with resources from any other 
instruction scheduled modulo iteration intervals away. In other words, for a 
2-cycle loop, instructions scheduled on cycle n cannot use the same resources 
as instructions scheduled on cycles n+ 2,n+4,n+6, etc. Table 7-13 shows 
the addition of the SHR bi+1 instruction. This must avoid a conflict of resources 
in cycles 5 and 7, which are one iteration interval away from each other. 


Even though LDW bi_i+1 (.D2, cycle 0) finishes on cycle 5, its child, SHR bi+1, 
cannot be scheduled on .S2 until cycle 6 because of a resource conflict with 
SHR pi+1_scaled, which is on .S2 in cycle 7. 


Figure 7-12. Dependency Graph of Weighted Vector Sum (Showing Resource Conflict) 


Scheduled 
on cycle 7 


A side B side 
LDW 


MPY MPYHL 


Scheduled 
SHR on cycle 5 


pi_scaled 
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Table 7-13. Modulo Iteration Interval Table for Weighted Vector Sum With SHR 


Instructions 
Unit / Cycle 0 2 4 6 8 10, 12, 14, ... 
= LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
ue LDW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
.M1 
.M2 
LI 
.L2 
SI 
52 ; , : 
SHR bi+1 SHR bi+1 SHR bi+1 
1X 
2X 
Unit / Cycle 1 3 5 7 9 11, 13, 15, ... 
.D1 
.D2 
.M1 ; F : : 
MPY pi MPY pi MPY pi MPY pi 
.M2 ‘ : : : 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
LI . ; F ; 
AND bi AND bi AND bi AND bi 
.L2 
SI ; : ‘ 
SHR pi_s SHR pi_s SHR pi_s 
S2 : ; : 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
1X : F : : 
MPY pi MPY pi MPY pi MPY pi 
2X . , . : 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 


Note: 
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The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 7-12. 
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7.6.6.2 Live Too Long 


Scheduling SHR bi+1 on cycle 6 now creates a problem with scheduling the 
ADD ci instruction. The parents of ADD ci (AND bi and SHR pi_scaled) are 
scheduled on cycles 5 and 7, respectively. Because the SHR pi_scaled is 
scheduled on cycle 7, the earliest you can schedule ADD ci is cycle 8. 


However, in cycle 7, AND bi * writes bi for the next iteration of the loop, which 
creates a scheduling problem with the ADD ci instruction. If you schedule 
ADD cion cycle 8, the ADD instruction reads the parent value of bi for the next 
iteration, which is incorrect. The ADD ci demonstrates a live-too-long problem. 


No value can be live in a register for more than the number of cycles in the loop. 
Otherwise, iteration n + 1 writes into the register before iteration n has read that 
register. Therefore, in a 2-cycle loop, a value is written to a register at the end 
of cycle n, then all children of that value must read the register before the end 
of cycle n + 2. 


7.6.6.3 Solving the Live-Too-Long Problem 


The live-too-long problem in Table 7-13 means that the bi value would have 
to be live from cycles 6-8, or 3 cycles. No loop variable can live longer than 
the iteration interval, because a child would then read the parent value for the 
next iteration. 


To solve this problem move AND bi to cycle 6 so that you can schedule ADD ci 
to read the correct value on cycle 8, as shown in Figure 7-13 and Table 7-14. 
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Figure 7-13. Dependency Graph of Weighted Vector Sum (With Resource Conflict 
Resolved) 
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Table 7-14. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


Unit/Cycle 0 2 4 6 8 10 
a LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
ue LDW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
M1 
.M2 
mI ADD ci ADD ci 
-L2 : : : 
AND bi AND bi AND bi 
S1 
.S2 , : F 
SHR bi+1 SHR bi+1 SHR bi+1 
1X 
2X 
Unit/Cycle 1 3 5 7 9 11 
.D1 
.D2 
M1 : ‘ : : 
MPY pi MPY pi MPY pi MPY pi 
.M2 F F : F 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
-L1 
-L2 
S1 : ; : 
SHR pi_s SHR pi_s SHR pi_s 
.S2 ‘ , : 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
1X : , : F 
MPY pi MPY pi MPY pi MPY pi 
2X F F : F 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 7—13. 
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7.6.6.4 Scheduling the Remaining Instructions 
Figure 7-14 shows the dependency graph with additional scheduling 


changes. The final version of the loop, with all instructions scheduled correctly, 
is shown in Table 7-15. 


Figure 7-14. Dependency Graph of Weighted Vector Sum (Scheduling ci +1) 
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Note: Shaded numbers indicate the cycle in which the instruction is first scheduled. 
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Table 7-15 shows the following additions: 


a Pa 


To 


B LOOP (.S1, cycle 6) 
SUB cnir (.L1, cycle 5) 
ADD ci+1 (.L2, cycle 10) 
STH ci (cycle 9) 

STH ci+1 (cycle 11) 


avoid resource conflicts and live-too-long problems, Table 7-15 also 


includes the following additional changes: 


uu 


L} 
L] 
L} 
L) 


LDW bi_i+1 (.D2) moved from cycle 0 to cycle 2. 

AND bi (.L2) moved from cycle 6 to cycle 7. 

SHR pi+1_scaled (.S2) moved from cycle 7 to cycle 9. 
MPYHL pi+1 moved from cycle 5 to cycle 6. 

SHR bi+1 moved from cycle 6 to 8. 


From the table, you can see that this loop is pipelined six iterations deep, be- 
cause iterations n and n+ 5 execute in parallel. 
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Table 7-15. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


Unit/Cycle 0 2 4 6 8 10, 12, 14, ... 
a LOW ai_i+1 | LDWai_i+1 | LDWaii+1 |) LDWaii+1 | LOWaii#t | LDWai_i+t 
02 LDW bii+1 | LOWbii+1 | LOWbii+1 | LDWbii+1 | LDWbi i+t 
M1 
ahi MPYHL pi+1 | MPYHL pi+1 | MPYHL pi+1 
A ADD ci ADD ci 
.L2 ADD ci+1 
SI B LOOP B LOOP B LOOP 
= SHR bi+1 SHR bi+1 
1X ADD ci ADD ci 
aie MPYHL pi+1 | MPYHL pi+1 | MPYHL pi+1 

Unit/Cycle 1 3 5 7 9 11, 13, 15, ... 
_ STH ci STH ci 
.D2 STH ci+1 
M1 ; * ; kk ; kaK ; 

MPY pi MPY pi MPY pi MPY pi 
M2 
LU SUB cnitr SUB cntr SUB cnitr SUB cntr 
2 ; * ; kk ; 
AND bi AND bi AND bi 
SI SHR pi_s SHR pi_s SHR pi_s 
ae SHR pi+1_s | SHRpi+1_s 
1X ; * ; ka ; ka ; 
MPY pi MPY pi MPY pi MPY pi 
2x 
Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 7—14. 
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7.6.7 Using the Assembly Optimizer for the Weighted Vector Sum 


Example 7—39 shows the linear assembly code to perform the weighted vector 
sum. You can use this code as input to the assembly optimizer to create a soft- 
ware-pipelined loop instead of scheduling this by hand. 


Example 7-39. Linear Assembly for Weighted Vector Sum 


-global _w_vec 
_Ww_vec: .cproc ay, By (Cy 
.reg ai_il, bil, pi, pil, 
.reg mask, bi, bil, ci, cil, 
MVK -1,mask ; 
MVKH 0,mask ; 
MVK 50,‘centr ; 
ADD 27:6 ; 
LOOP: trip 50 
LDW «Di *at+,ai_il ; 
LDW .D2 *ot++,bi_il ; 
MPY -M1X ai_il,m,pi ; 
MPYHL .-M2X ai_il,m,pil ; 
SHR .S1 pi,15,pi_s ; 
SHR S52 pil,15;pil_s. ; 
AND ~L2 bi_il,mask,bi; 
SHR -S2 bi_il,16,bil ; 
ADD .L1X pi_s,bi,ci ; 
ADD ~L2 pil_s,bil,cil; 
STH -D1 Gi, *e++ [2] ; 
STH .D2 cil,*cl++[2] ; 
fentr] SUB Lt entr,;1, centr ; 
Lentr] B «81 LOOP ; 
-endproc 


pi_il, 


pi_s, pil_s 


el, cntr 

set to all 1s to create OxFFFFFFFF 
clear upper 16 bits to create OxFFFF 
centr = 100/2 

point to c[1] 


ai & aitl 
bi & bitl 
m * ai 

m * ai+l 
(m * ai) 
(m * ait+l) 
bi 

bit+l 

ci = (m * ai) 
cit+l = (m * aitl) 
store ci 

store citl 
decrement loop counter 
branch to loop 


>> 15 
>> 15 


>> 15 + bi 
>> 15 + bitl 
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7.6.8 Final Assembly 


Example 7-40 shows the final assembly code for the weighted vector sum. 
The following optimizations are included: 


(1 While iteration n of instruction STH ci+1 is executing, iteration n + 1 of 
STH ciis executing. To prevent the STH ci instruction from executing itera- 
tion 51 while STH ci + 1 executes iteration 50, execute the loop only 49 
times and schedule the final executions of ADD ci+1 and STH ci+1 after 
exiting the loop. 


_j The mask for the AND instruction is created with MVK and MVKH in paral- 
lel with the loop prolog. 


_) The pointer to the odd elements in array c is also set up in parallel with the 
loop prolog. 
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Example 7-40. Assembly Code for Weighted Vector Sum 


LDW -D1 *D4++, A2 

ADD -L2X  A6,2,B0 

LDW D2 *B4++,B2 
| | LDW D1 *RA++,A2 

MVK S2 -1,B10 

LDW D2 *B4++,B2 

| LDW D1 *D4++,A2 
| MVK sl 49,Al 
| MVKH $2 0,B10 

MPY M1X A2,B6,A5 

| [Al] SUB Ll Al,1,Al 

MPYHL .M2X A2,B6,B5 

| [Al] B sl LOOP 
| LDW D2 *B4++,B2 
| LDW D1 *D4++,A2 

SHR sl A5,15,A7 

AND L2 B2,B10,B8 
MPY M1X A2,B6,A5 
| [Al] SUB L1 A1,1,Al 

SHR .S2 B2,16,Bl 

ADD -L1X  A7,B8,A9 
| MPYHL .M2X A2,B6,B5 
| [Al] B .S1 LOOP 
| LDW D2 *B4++,B2 
LDW D1 *D4++,A2 
SHR .S2 B5,15,B7 
STH D1 A9, *A6++[2] 
| SHR sl A5,15,A7 
| AND L2 B2,B10,B8 
| [Al] SUB Ll Al,1,Al 
MPY M1X A2,B6,A5 
LOOP: 

ADD L2 B7,B1,B9 
| | SHR S2 B2,16,Bl 
| | ADD L1X  A7,B8,A9 
| | MPYHL .M2X A2,B6,B5 
| | [A1] B sl LOOP 
| | LDW D2 *BA++,B2 
| | LDW D1 *RA++,A2 


a 


a 


a 


, 


a 


ai & aitl 
set pointer to citl 


bi & bitl 
* ai & ai+l 


set to all ls (OxFFFFFFFF) 


;* bi & bitl 
;** ai & aitl 


set up loop counter 
clr upper 16 bits (Ox0000FFFF) 


m * ai 
decrement loop counter 


m * ai+l 
branch to loop 
** bi & bitl 
*** ai & aitl 


(m * ai) >> 15 
bi 


ce m -* ai 


a 


a 


a 


;* decrement loop counter 


bit+l 
ci = (m * ai) 
* m * aitl 


>> 15 + bi 


;* branch to loop 


**x* bi & bit+l 


pR**K* ai & aitl 


, 
, 
r 
, 
, 


a 


, 
r 
r 
, 
a 
, 


r 


(m * aitl) >> 15 
store ci 
;* (m * ai) >> 15 


* bi 
** decrement loop counter 
xk m * ai 


citl = (m * ait+l) >> 15 + bitl 
* birl 
* ci = (m * ai) >> 15 + bi 


** m * aitl 

** branch to loop 
*** bi & bitl 
x**KKK ai & aitl 
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Example 7-40. Assembly Code for Weighted Vector Sum (Continued) 


[Al] 


STH 
SHR 
STH 
SHR 
AND 
SUB 
MPY 


-D2 
~S2 
-D1 
vol 
~L2 
-L1 
.M1X 


B9, *BO++[2] 
B5,;.15;B7 
A9, *A6++[2] 
A5,15,A7 
B2,B10,B8 
Al,1,Al1 
A2,B6,A5 


; Branch occurs here 


ADD 


STH 


~L2 


-D2 


B7,B1,B9 


B9, *BO 


’ 
ti 
’ 
’ 
’ 
’ 


’ 


’ 


’ 


store cit+l 


;* (m * aii+1) o> 15 


* store ci 

** (m * ai) >> 15 

k* bi 

***x decrement loop counter 
KKK m * a 


ci+l = (m * ait+l) >> 15 + bitl 


store cit+l 
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7.7 Loop Carry Paths 


Loop carry paths occur when one iteration of a loop writes a value that must 
be read by a future iteration. A loop carry path can affect the performance of 
a software-pipelined loop that executes multiple iterations in parallel. Some- 
times loop carry paths (instead of resources) determine the minimum iteration 
interval. 


IIR filter code contains a loop carry path; output samples are used as input to 
the computation of the next output sample. 


7.7.1 IIR Filter C Code 


Example 7-41 shows C code for a simple IIR filter. In this example, y[i] is an 
input to the calculation of y[i+1]. Before y[i] can be read for the next iteration, 
y[i+1] must be computed from the previous iteration. 


Example 7-41. IIR Filter C Code 


void iir(short x[],short y[],short cl, short c2, short c3) 
{ 


Int. ae 


for (i=0; i<100; it+) { 
y[itl] = (cl*x[i] + c2*x[itl] + c3*y[i]) >> 15; 
} 


Loop Carry Paths 


7.7.2 Translating C Code to Linear Assembly (Inner Loop) 


Example 7—42 shows the ’C6x instructions that execute the inner loop of the 
IIR filter C code. In this example: 


Lj xpir is not postincremented after loading xi+1, because xi of the next 
iteration is actually xi+1 of the current iteration. Thus, the pointer points to 
the same address when loading both xi+1 for one iteration and xi for the 
next iteration. 


(1 yptr is also not postincremented after storing yi+1, because yi of the next 
iteration is yi+1 for the current iteration. 


Example 7-42. Linear Assembly for IIR Inner Loop 


DH *xptrt++, xi xit+l 
PY cl,xi,p0 GL. et sea: 
DH ESS Goie ey cabal pela il 


e2 *-x141 
cl * 21.4 @2 * xi#1 


L 

M 

L 

MPY c2,xitl,pl 
ADD p0,pl,s0 
L 
M 
A 
Ss 
Ss 


DH WAS espa Wal yi 

PY c3,yi,p2 c3 * yi 

DD s0,p2,sl Glo © ep. Foe? A eased 183 aya 
HR 81,15; yitl yitl 


TH yitl,*yptr 
[cntr] SUB entr,1,cntr 
{cntr]B LOOP 


store yitl 
decrement loop counter 
branch to loop 


ee i i Tee 
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7.7.3. Drawing a Dependency Graph 


Figure 7-15 shows the dependency graph for the IIR filter. A loop carry path 
exists from the store of yi+1 to the load of yi. The path between the STH and 
the LDH is one cycle because the load and store instructions use the same 
memory pipeline. Therefore, if a store is issued to a particular address on cycle 
nand a load from that same address is issued on the next cycle, the load reads 
the value that was written by the store instruction. 


Figure 7-15. Dependency Graph of IIR Filter 


A side B side 
LDH LDH LDH 


Note: The shaded numbers show the loop carry path:5+2+1+1+1+=10. 


Loop Carry Paths 


7.7.4 Determining the Minimum Iteration Interval 


To determine the minimum iteration interval, you must consider both resources 
and data dependency constraints. Based on resources in Table 7—16, the 
minimum iteration interval is 2. 


Note: 


There are six non-.M units available: three on the A side (.S1, .D1, .L1) and 
three on the B side (.S2, .D2, .L2). Therefore, to determine resource 
constraints, divide the total number of non-.M units used on each side by 3 
(3 is the total number of non-.M units available on each side). 


Based on non-.M unit resources in Table 7-16, the minimum iteration inter- 
val for the IIR filter is 2 because the total non-.M units on the Aside is 5 (5 + 3 
is greater than 1 so you round up to the next whole number). The B side uses 
only three non-.M units, so this does not affect the minimum iteration interval, 


and no other unit is used more than twice. 
Si 


Table 7-16. Resource Table for IIR Filter 
(a) A side 


Unit(s) 


Instructions 


(b) B side 


Total/Unit Unit(s) Instructions Total/Unit 


.M1 2 MPYs 2 .M2 MPY 1 
S1 B 1 S2 SHR 1 
.D1 2 LDHs 2 .D2 STH 1 
.L1,.$1,or.D1 ADD &SUB 2 .L2 or .S2,.D2 ADD 1 
Total non-.M units 5 Total non-.M units 3 


However, the IIR has a data dependency constraint defined by its loop carry 
path. Figure 7-15 shows that if you schedule LDH yi on cycle 0: 


_j The earliest you can schedule MPY p2 is on cycle 5. 


The earliest you can schedule ADD s1 is on cycle 7. 


I 
.) SHR yi+1 must be on cycle 8 and STH on cycle 9. 
J 


Because the LDH must wait for the STH to be issued, the earliest the the 
second iteration can begin is cycle 10. 


To determine the minimum loop carry path, add all of the numbers along the 
loop paths in the dependency graph. This means that this loop carry path is 
10(56+2+1+1+1). 
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Although the minimum iteration interval is the greater of the resource limits and 
data dependency constraints, an interval of 10 seems slow. Figure 7-16 
shows how to improve the performance. 


7.7.4.1. Drawing a New Dependency Graph 


Figure 7-16 shows a new graph with a loop carry path of 4 (2 +1 + 1). because 
the MPY p2 instruction can read yi+1 while it is stillin a register, you can reduce 
the loop carry path by six cycles. LDH yiis no longer in the graph. Instead, you 
can issue LDH y/[0] once outside the loop. In every iteration after that, the y+1 
values written by the SHR instruction are valid y inputs to the MPY instruction. 


Figure 7-16. Dependency Graph of IIR Filter (With Smaller Loop Carry) 


B side 


A side 


! 
! 


Note: The shaded numbers show the loop carry path:2 +141 =4. 


7.7.4.2 New ’C6x Instructions (Inner Loop) 


Loop Carry Paths 


Example 7-43 shows the new linear assembly from the graph in Figure 7-16, 
where LDH yi was removed. The one variable y that is read and written is yi 
for the MPY p2 instruction and yi+1 for the SHR and STH instructions. 


Example 7-43. Linear Assembly for IIR Inner Loop With Reduced Loop Carry Path 


HNHnNPS PEP EH 


(=) 
a 


HR 


UB 


*xptrt++, xi : 
cl,xi,p0 ; 
*xptr,xitl ; 
c2,xit1,pl F 
p0,pl,s0 ; 
c3,y,p2 i 
s0,p2,sl ; 
Sil, Lp Vv g 
y,*yptrt+t i 


entr, 1,cntr , 
LOOP 7 


xi+1 
cls 
S141 
C2: * 
ron ia! 
es 
CL) -* 
Walarit 


xi 


xit+l 


Xi C2 * xL+1 


yi 


RL e2 *) Raed ch G3 * ya 


store yitl 
decrement loop counter 
branch to loop 


7.7.5 Linear Assembly Resource Allocation 


Example 7-44 shows the same linear assembly instructions as those in 


Example 7-43 with the functional units and registers assigned. 


Example 7-44. Linear Assembly for IIR Inner Loop (With Allocated Resources) 


L 
M 
L 
M 
A 
M 
A 
s 
Ss 
[Al] s 
[Al] B 


DL *A4++,A2 
M1 A6,A2,A5 
-D1 *A4,A3 


.M1X  B6,A3,A7 
sid A5,A7,A9 
.M2X A8,B2,B3 
.L2X  B3,A9,B5 
.S2 B5,15,B2 


-D2 B2, *B4++ 
.L1 Al,1,Al1 
-S1 LOOP 


xit+l 
pel 
xit+l 
parey 
; C2 
; 63 
7 el 
yitl 
store yitl 

decrement loop counter 
branch to loop 


* 


* 


* 


* 


* 


34: 


xitl 

SL +62 * sere. 

yi 

xL + -c2.* x1+1 4+ 63 * yi 
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7.7.6 Modulo Iteration Interval Scheduling 


Table 7-17 shows the modulo iteration interval table for the IIR filter. The SHR 
instruction on cycle 10 finishes in time for the MPY p2 instruction from the next 
iteration to read its result on cycle 11. 


Table 7-17. Modulo Iteration Interval Table for IIR (4-Cycle Loop) 


Unit/Cycle 0 4 8, 12, 16, ... || Unit/Cycle 9, 13, 17, ... 
= ere LDH xi LDH xi LDH x1 | DH xiet | LDH ci+t 
D2 aes D2 
M1 M1 MPY pO | My 90 
M2 M2 
uy “4 SUB ent | dene ate 
L2 L2 ADD s1 
‘SI ‘St 
82 82 
1X 1X 
2x | 2x ADD s1 

Unit/Cycle 2 6 10, 14, 18, ... | Unit/cycle 3 7 11, 15, 19, ... 
DI DI 
D2 D2 STH yi+1 
me MPY p1 MPY p1 om 
ve ae MPY p2 | MPY p2 
u ir 
12 12 
St BLOOP | sicop St 
S2 SHR yi+1 oe 
ee MPY pt MPY p1 - 
= MPY p2 | MPY p2 

Note: The asterisks indicate the iteration of the loop. 
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7.7.7 Using the Assembly Optimizer for the IIR Filter 


Example 7—45 shows the linear assembly code to perform the IIR filter. Once 
again, you can use this code as input to the assembly optimizer to create a soft- 
ware-pipelined loop instead of scheduling this by hand. 


Example 7-45. Linear Assembly for IIR Filter 


-gGlobal _iir 


alirs  s¢proc x, y; cl; ¢2;, ¢3 
-reg xi, xil, yil 
.reg pO, pl, p2, sO, sl, cntr 
MVK LOO, centr ; centr = 100 
LDH -D2 *yt+t+,yil 7 yitl 
LOOP: .trip 100 
LDH 2D1, *3e4+, 324. yh 
MPY -M1l cl,xi,p0 e5GL °F x22 
LDH «D1, *3, 201 ssa 
MPY -M1X c2,xil,pl : C2 * 3241 
ADD -L1 p0,pl1,s0 3; cl * xi + c2 * xitl 
MPY -M2X% c3,yil,p2 FCS. kya 
ADD -L2X s0,p2,sl1 ; cl * xi + c2 * xitl + c3 * yi 
SHR 282 sl,15;, yi 7 yitl 
STH -D2 yil,*yt++ ; store yitl 
[cntr] SUB -L1 cntr,1,cntr ; decrement loop counter 
{cntr] B -S1 LOOP ; branch to loop 
-endproc 


Optimizing Assembly Code via Linear Assembly 7-85 


Part Ill 


Part Ill 


Loop Carry Paths 


7.7.8 Final Assembly 


Example 7—46 shows the final assembly for the IIR filter. With one load of y[0] 
outside the loop, no other loads from the y array are needed. Example 7-46 
requires 408 cycles: (4100) + 8. 


Example 7-46. Assembly Code for IIR Filter 


LDH .D1 *R4++,A2 ; xi 
LDH .D1 *A4,A3 ; xitl 
LDH ~b2 *B4++,B2 ; load y[0] outside of loop 
MVK -S1 100,Al1 ; set up loop counter 
LDH .D1 *R4++,A2 p* xi 
[Al] SUB Ba ia A1,1,Al1 ; decrement loop counter 
| | MPY M1 A6,A2,A5 ; cl * xi 
| | LDH .D1 *D4,A3 7;* xitl 
MPY .M1X  B6,A3,A7 ; c2 * xitl 
|| [Al] B ol LOOP 7 branch to loop 
MPY .M2X  A8,B2,B3 7 ¢c3 * yi 
LOOP: 
ADD L1 A5,A7,A9 ; cl * xi + c2 * xitl 
LDH .D1 *D4++,A2 pee xd 
ADD .L2X B3,A9,B5 ¢ Cl * x1 + e2 * xitl +03 * yi 
| [Al] SUB 5 bil A1,1,Al1 7* decrement loop counter 
| MPY M1 A6,A2,A5 ;* cl * xi 
| LDH .D1 *D4,A3 7** xitl 
SHR 282 B5,15,B2 ; yitl 
MPY .M1X  B6,A3,A7 ;* c2 * xitl 
| [Al] B SL LOOP 7* branch to loop 
STH +D2 B2, *B4t++ ; store yit+l 
| | MPY .M2X  A8,B2,B3 i* c3 * yi 
; Branch occurs here 


7-86 


If-Then-Else Statements in a Loop 


7.8 If-Then-Else Statements in a Loop 


If-then-else statements in C cause certain instructions to execute when the if 
condition is true and other instructions to execute when it is false. One way to 
accomplish this in linear assembly code is with conditional instructions. be- 
cause all ’C6x instructions can be conditional on one of five general-purpose 
registers, conditional instructions can handle both the true and false cases of 
the if-then-else C statement. 


7.8.1 If-Then-Else C Code 


Example 7-47 contains a loop with an if-then-else statement. You either add 
a[i] to sum or subtract a[i] from sum. 


Example 7-47. If-Then-Else C Code 


int if_then(short a[], int codeword, int mask, short theta) 
{ 


int i,sum, cond; 


sum = 0; 
for (i = 0; i < 32; it++t){ 
cond = codeword & mask; 


if (theta == !(!(cond))) 
sum += a[i]; 

else 
sum -= a[i]; 


mask = mask << 1; 
} 
return (sum) ; 


} 


Branching is one way to execute the if-then-else statement: branch to the ADD 
when the if statement is true and branch to the SUB when the if statement is 
false. However, because each branch has five delay slots, this method 
requires additional cycles. Furthermore, branching within the loop makes soft- 
ware pipelining almost impossible. 


Using conditional instructions, on the other hand, eliminates the need to 
branch to the appropriate piece of code after checking whether the condition 
is true or false. Simply program both the ADD and SUB as usual, but make 
them conditional on the zero and nonzero values of a condition register. This 
method also allows you to software pipeline the loop and achieve much better 
performance than you would with branching. 
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7.8.2 Translating C Code to Linear Assembly 


Example 7—48 shows the linear assembly instructions needed to execute in- 


ner loop of the C code in Example 7-47. 


Example 7-48. Linear Assembly for If-Then-Else Inner Loop 


AND codeword, mask, cond 
cond] MVK 1,cond 

CMPEQ theta,cond,if 

LDH *kaptr++,ai 
if] ADD sum, ai,sum 
!'if] SUB sum, ai,sum 

SHL mask,1,mask 
cntr] ADD =1, Cntr,cntr 
entr]B LOOP 


Ne Ne Ne Ne Ne Ne we 


cond = codeword & mask 
!(! (cond) ) 

(theta == !(! (cond) )) 
a[i] 

sum += a[i] 

sum —-= a[i] 

mask = mask << 1; 


decrement counter 
for LOOP 


CMPEQ is used to create IF. The ADD is conditional when IF is nonzero (corre- 
sponds to then); the SUB is conditional when IF is 0 (corresponds to else). 


A conditional MVK performs the !(!(cond)) C statement. If the result of the 
bitwise AND is nonzero, a 1 is written into cond; if the result of the AND is 0, 
cond remains at 0. 


If-Then-Else Statements in a Loop 


7.8.3 Drawing a Dependency Graph 


Figure 7-17 shows the dependency graph for the if-then-else C code. This 
graph illustrates the following arrangement: 


L) Two nodes on the graph contain sum: one for the ADD and one for the 
SUB. Because some iterations are performing an ADD and others are 
performing a SUB, each of these nodes is a possible input to the next itera- 
tion of either node. 


(J The LDH ai instruction is a parent of both ADD sum and SUB sum, be- 
cause both instructions read ai. 


_j CMPEQ if is also a parent to ADD sum and SUB sum, because both read 
IF for the conditional execution. 


_) The result of SHL mask is read on the next iteration by the AND cond 
instruction. 


Figure 7-17. Dependency Graph of If-Then-Else Code 


A side 
SHL 


B side 
AND 
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7.8.4 Determining the Minimum Iteration Interval 


With nine instructions, the minimum iteration interval is at least 2, because a 
maximum of eight instructions can be in parallel. Based on the way the depen- 
dency graph in Figure 7—17 is split, five instructions are on the A side and four 
are on the B side. Because none of the instructions are MPYs, all instructions 
must go on the .S, .D, or .L units, which means you have a total of six 
resources. 


(j LDH must be on a.D unit. 

Lj SHL, B, and MVK must be on a.S unit. 

(1 The ADDs and SUB can be on the .S, .L, or .D units. 
(1 The AND can be ona.§S or .L unit. 


From Table 7-18, you can see that no one resource is used more than two 
times, so the minimum iteration interval is still 2. 


Table 7-18. Resource Table for If-Then-Else Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) —~—~—SIInstructions.==—S~<Sotaal/Unit 

.M1 0 .M2 0 

S1 SHL & B 2 .S2 MVK 1 

.D1 LDH 1 -L2 CMPEQ 1 

.L1,.S1,or.D1 ADD &SUB 2 .L2 or .S2 AND 1 
.L2,.S2,or.D2 ADD 1 

Total non-.M units 5 


Total non-.M units 4 


The minimum iteration interval is also affected by the total number of instruc- 
tions. Because three units can perform nonmultiply operations on a given side, 
a total of five instructions can be performed with a minimum iteration interval 
of 2. Because only four instructions are on the B side, the minimum iteration 
interval is still 2. 
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7.8.5 Linear Assembly Resource Allocation 


Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
that no resource is used more than twice. 


Example 7—49 shows the linear assembly with the functional units and regis- 
ters that are used in the inner loop. 


Example 7-49. Linear Assembly for Full If-Then-Else Code 


_if_then: 


LOOP: 


[cond 


[if] 


ems 


[entz] 
[centr 


-global _if_then 
-cproc a, cword, mask, theta 
.reg cond, if, ai, sum, cntr 
MVK 32,cnEr ; centr = 32 
ZERO sum ; sum = 0 
trip 32 
AND s52Zx cword,mask,cond ; cond = codeword & mask 
MVK 82 1,cond ¢ LC! Ceond) ) 
CMPEQ «2 theta, cond,if ; (theta == !(! (cond) )) 
LDH .D1 *katt,ai 7 ali) 
ADD -L1 sum, ai,sum ; sum += a[il] 
SUB 21 sum, ai,sum ; sum -= ali] 
SHL sS1 mask,1,mask ; mask = mask << 1; 
ADD jae = ,Cnie, cnir ; Gecrement counter 
B <S1 LOOP ; for LOOP 
.-return sum 
-endproc 
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7.8.6 Final Assembly 


Example 7—50 shows the final assembly code after software pipelining. The 
performance of this loop is 70 cycles (2 x 32 + 6). 


Example 7-50. Assembly Code for If-Then-Else 


MVK -S2 32,B0 ; set up loop counter 
[BO] ADD ~L2 -1,B0,BO ; decrement counter 
[BO] ADD +» L2 =1,B0,B0 ; decrement counter 
| [BO] B .S1 LOOP ; for LOOP 
LDH .D1 *A44++,A5 ; ali] 
SHL ow A6,1,A6 ; mask = mask << 1; 
AND ~S2X B4,A6,B2 ; cond = codeword & mask 
[B2] MVK $2 1,B2 7 11 (eond)) 
| [BO] ADD ~L2 -1,B0,BO ; decrement counter 
| [BO] B .S1 LOOP ;* for LOOP 
LDH .D1 *R44++,A5 7* ali] 
CMPEQ .L2 B6,B2,B1 ; (theta == !(! (cond) )) 
SHL 782 A6,1,A6 ;* mask = mask << 1; 
| AND ~S2X B4,A6,B2 ;* cond = codeword & mask 
ZERO sacl, A7 ; zero out accumulator 
LOOP: 
[BO] ADD -L2 =1,B0,B0 ; decrement counter 
| [B2] MVK .S2 1,B2 7* !(! (cond) ) 
| [BO] B .S1 LOOP ;** for LOOP 
LDH .D1 *R44++,A5 7** afi] 
[B1] ADD sbi A7,A5,A7 ; sum += ali] 
| [!B1]SUB .D1 AT,A5,A7 ; sum -= a[i] 
CMPEQ .L2 B6,B2,Bl1 7* (theta == !(!(cond))) 
| SHL .S1 A6,1,A6 ;** mask = mask << 1; 
AND ~S2X B4,A6,B2 7** cond = codeword & mask 
; Branch occurs here 
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You can improve the performance of the code in Example 7-50 if you know 
that the loop count is at least 3. If the loop count is at least 3, remove the decre- 
ment counter instructions outside the loop and put the MVK (for setting up the 
loop counter) in parallel with the first branch. These two changes save two 
cycles at the beginning of the loop prolog. 


The first two branches are now unconditional, because the loop count is at 
least 3 and you know that the first two branches must execute. To account for 
the removal of the three decrement-loop-counter instructions, set the loop 
counter to 3 fewer than the actual number of times you want the loop to 
execute: in this case, 29 (32 — 3). 


Example 7-51. Assembly Code for If-Then-Else With Loop Count Greater Than 3 


-S1 
-D1 
~S2 


-S1 
-S2X 


-S2 
-S1 
«DL 


-L2 
-S1 
-S2X 
-L1 


-L2 
-S2 
-S1 
-D1 


-L1 
-D1 
-L2 
-S1 
-S2X 


LOOP 
*A44++,A5 
29,BO 


Ao,1,A6 
B4,A6,B2 


1,B2 
LOOP 
*A44++,A5 


B6o,B2,Bl1 
Ao,1,A6 
B4,A6,B2 
Al 


-1,B0,BO 
1,B2 
LOOP 
*A44++,A5 


A7,A5,A7 
A7,A5,A7 
Bo,B2,Bl1 
Ao,1,A6 

B4,A6,B2 


; Branch occurs here 


; for LOOP 
; ali] 
7 set up loop counter 


; mask = mask << 1; 


7 cond = codeword & mask 
!(! (cond) ) 

;* for LOOP 

7* ali] 
(theta == !(! (cond) )) 


;* mask = mask << 1; 
* cond = codeword & mask 
zero out accumulator 


; decrement counter 
7* !(! (cond) ) 
3; ** for LOOP 


, 

; sum -= a[il] 

7* (theta == !(! (cond) )) 
;** mask = mask << 1; 


;** cond = codeword & mask 


Example 7-51 shows the improved loop with a cycle count of 68 (2 x 32+ 4). 
Table 7-19 compares the performance of Example 7-50 and Example 7-51. 
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Table 7-19. Comparison of If-Then-Else Code Examples 


Code Example 


Example 7-50 _If-then-else assembly code 


Example 7-51 If-then-else assembly code with loop count greater than 3 


Cycles 
(2 x 32)+6 
(2 x 32)4+4 


Cycle Count 
70 


68 


Loop Unrolling 


7.9 Loop Unrolling 


Even though the performance of the previous example is good, it can be im- 
proved. When resources are not fully used, you can improve performance by 
unrolling the loop. In Example 7-52, only nine instructions execute every two 
cycles. If you unroll the loop and analyze the new minimum iteration interval, 
you have room to add instructions. A minimum iteration interval of 3 provides 
a 25% improvement in throughput: three cycles to do two iterations, rather 
than the four cycles required in Example 7-51. 


7.9.1. Unrolled If-Then-Else C Code 


Example 7-52 shows the unrolled version of the if-then-else C code in 
Example 7-47 on page 7-87. 


Example 7-52. If-Then-Else C Code (Unrolled) 


int unrolled_if_then(short a[], int codeword, int mask, short theta) 


{ 


int i,sum, cond; 


sum = 0; 
for (i = 0; i < 32; it=2){ 
cond = codeword & mask; 


if (theta == !(!(cond))) 
sum += a[i]; 

else 
sum -= a[i]; 


mask = mask << 1; 


cond = codeword & mask; 


if (theta == !(!(cond))) 
sum += a[itl]; 

else 
sum -= a[itl]; 


mask = mask << 1; 
} 
return (sum) ; 


} 
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7.9.2 Translating C Code to Linear Assembly 


Example 7-53 shows the unrolled inner loop with 16 instructions and the 
possibility of achieving a loop with a minimum iteration interval of 3. 


Example 7-53. Linear Assembly for Unrolled If-Then-Else Inner Loop 


AND codeword, maski, condi ; condi = codeword & maski 
[condi] MVK 1,condi - It] {eondi) ) 
CMPEQ theta, condi,ifi ; (theta == !(! (condi) )) 
LDH *aptrt+t+,ai ; ali] 
[ifi] ADD sumi,ai,sumi ; sum += a[il] 
[!ifi] SUB sumi,ai,sumi ; sum —-= a[il] 
SHL maski,1,maskitl ; maski+l = maski << 1; 
AND codeword, maskit+1,condit+l; condi+l = codeword & maski+l 
[condit+1]MVK 1, condi.4+1 * I (1) (eondi4+l)) 
CMPEQ theta, condit+1l,ifit+l ; (theta == !(! (condit+l))) 
LDH *aptr+t+,aitl 7 afit!] 
[ifit+1] ADD sumit+l,ait+l, sumitl ; sum += a[itl] 
[!ifi+1l] SUB sumiti1,ai+l1,sumi+l ; sum —-= a[itl] 
SHL maski+1,1,maski ; maski = maski+l << 1; 
[centr] ADD =1, Cntr, cntr ; decrement counter 
[ents] B LOOP ; for LOOP 
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7.9.3. Drawing a Dependency Graph 


Although there are numerous ways to split the dependency graph, the main 
goal is to achieve a minimum iteration interval of 3 and meet these conditions: 


_j You cannot have more than nine non-.M instructions on either side. 
(j Only three non-.M instructions can execute per cycle. 


Figure 7-18 shows the dependency graph for the unrolled if-then-else code. 
Nine instructions are on the A side, and seven instructions are on the B side. 


Figure 7-18. Dependency Graph of If-Then-Else Code (Unrolled) 
B side 


A side 
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7.9.4 Determining the Minimum Iteration Interval 


With 16 instructions, the minimum iteration interval is at least 3 because a 
maximum of six instructions can be in parallel with the following allocation 
possibilities: 


(j LDH must be on a.D unit. 

[1 SHL, B, and MVK must be on a.S unit. 

(1 The ADDs and SUB can be ona..§, .L, or .D unit. 
(1 The AND can be ona.S or .L unit. 


From Table 7-20, you can see that no one resource is used more than three 
times so that the minimum iteration interval is still 3. 


Checking the total number of non-.M instructions on each side shows that a 
total of nine instructions can be performed with the minimum iteration interval 
of 3. because only seven non-.M instructions are on the B side, the minimum 
iteration interval is still 3. 


Table 7-20. Resource Table for Unrolled If-Then-Else Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) —~=CWdInstructions=——S~<Sootal/Unitt 
.M1 .M2 0 

S1 MVK and 2 SHLs .S2 MVK and B 2 

.D1 2 LDHs 2 -L2 CMPEQ 1 

-L1 CMPEQ 1 .L2 pr.S2 AND 1 

.L1 or .S1 AND 1 .L2 ,.S2, or .D2 SUB and 2ADDs 3 
.L1,.S1,or.D1 ADD and SUB 2 

Total non-.M units 9 Total non-.M units 7 


7.9.5 Linear Assembly Resource Allocation 


Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
no resource is used more than three times. 


Example 7-54 shows the linear assembly code with the functional units and 
registers. 


Loop Unrolling 


Example 7-54. Linear Assembly for Full Unrolled If-Then-Else Code 


_unrolled_if_then: 


LOOP: 


[cdi] 


{ifi] 
[!ifi] 


[cdil] 


[ifil] 
Lieto] 


[centr] 
[entr] 


-global _unrolled_if_then 
-cproc a, cword, mask, theta 
.reg cword, mask, theta, ifi, ifil, a, ai, ail, cntr 
.reg cdi, cdil, sumi, sumil, sum 
MV A4,a 7 C callable register for lst operand 
MV B4,cword 7 C callable register for 2nd operand 
MV A6é,mask ; C callable register for 3rd operand 
MV B6,theta ; C callable register for 4th operand 
MVK 16,cntr yj entr = 32/2 
ZERO sumi ; sumi = 0 
ZERO sumil ; sumit+l = 0 
,<trip: 32 
AND -L1X cword,mask,cdi ; cdi = codeword & maski 
MVK -Sl 1,cdi 7 '(!'(cdi)) 
CMPEQ .L1IXtheta,cdi,ifi ; (theta == !(! (cdi))) 
LDH D1 *at++,ai 7; afil 
ADD L1 sumi,ai,sumi 7; sum += ali] 
SUB D1 sumi,ai,sumi ; sum -= ali] 
SHL Sl mask,1,mask ; maski+l = maski << 1; 
AND .-L2X cword,mask,cdil ; cdi+l = codeword & maski+tl 
MVK S2 1,cdil 7 !'(! (cditl)) 
CMPEQ L2 theta,cdil,ifil; (theta == !(!(cdi+1))) 
LDH D1 *at+,ail ; afitl] 
ADD L2 sumil,ail,sumil; sum += a[itl] 
SUB -D2 sumil,ail,sumil; sum -= a[itl] 
SHL -S1l mask,1,mask ; maski = maskit+l << 1; 
ADD «D2 =l,entr; cntr ; decrement counter 
B +S2. LOOP ; for LOOP 
ADD sumi,sumil,sum ; Add sumi and sumitl for ret value 


-return sum 


-endproc 
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7.9.6 Final Assembly 


Example 7-55 shows the final assembly code after software pipelining. The 
cycle count of this loop is now 53: (3 x 16) + 5. 


Example 7-55. Assembly Code for Unrolled If-Then-Else 


MVK -S2 16,B0 7 set up loop counter 
LDH Spi *R44++,A5 ; ali] 
| [BO] ADD “D2 -1,B0,BO ; decrement counter 
LDH .Di *R44++,B5 ; a[lit+l] 
| [BO] B -S2 LOOP ; for LOOP 
| [BO] ADD -D2 -1,B0,BO ; decrement counter 
| SHL #81 A6,1,A6 ; maski+l = maski << 1; 
| AND ~L1X B4,A6,A2 ; condi = codeword & maski 
[A2] MVK SL 1,A2 7 !'!(! (condi) ) 
| AND -L2X  B4,A6,B2 ; condi+l = codeword & maski+1 
ZERO Pail A7 7 zero accumulator 
{[B2] MVK ~S2 1,B2 7 !'(! (condi+1) ) 
| CMPEQ .L1X B6,A2,Al1 ; (theta == !(! (condi))) 
| SHL .S1 A6,1,A6 ; maski = maskit+l << 1; 
| LDH .D1 *A4++,A5 ;* ali] 
| ZERO «2 B7 7 zero accumulator 
LOOP 
CMPEQ .L2 B6,B2,B1 ; (theta == !(! (condi+l1))) 
| [BO] ADD .D2 -1,B0,BO ; decrement counter 
LDH .D1 *R4++,B5 ;* afitl] 
| [BO] B -S2 LOOP ;* for LOOP 
SHL .S1 A6,1,A6 ;* maski+l = maski << 1; 
| AND ~L1X B4,A6,A2 ;* condi = codeword & maski 
[Al] ADD = Erk A7,A5,A7 ; sum += ali] 
| [!A1] SUB -D1 A7,A5,A7 ; sum -= a[i] 
| [A2] MVK .S1 1,A2 7* 1! (1 (condi) ) 
AND ~L2X B4,A6,B2 7* condit+tl = codeword & maski+l 
[B1] ADD Pee B7,B5,B7 ; sum += a[itl] 
| [!B1]SUB -D2 B7,B5,B7 ; sum -= a[i+l] 
| [B2] MVK 782 1,B2 7* !(!(condit1) ) 
CMPEQ .L1X B6,A2,Al1 7;* (theta == !(!(condi))) 
| SHL 7S A6,1,A6 7* maski = maskit+l << 1; 
| LDH -D1 *A4++,A5 7** afi] 
; Branch occurs here 
ADD .L1X A7,B7,A4 7 move to return register 
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7.9.7 Comparing Performance 


Table 7-21 compares the performance of all versions of the if-then-else code 
examples. 


Table 7-21. Comparison of If-Then-Else Code Examples 


Code Example Cycles Cycle Count 
Example 7-50 _If-then-else assembly code (2 X 32)+6 70 
Example 7-51 If-then-else assembly code with loop count greater than3 (2 x 32)+4 68 
Example 7-55 Unrolled if-then-else assembly code (3 x 16)+5 53 
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7.10 Live-Too-Long Issues 


When the result of a parent instruction is live longer than the minimum iteration 
interval of a loop, you have a live-too-long problem. Because each instruction 
executes every iteration interval cycle, the next iteration of that parent over- 
writes the register with a new value before the child can read it. Section 7.6.6.1, 
Resource Conflicts, on page 7-66 showed how to solve this problem simply 
by moving the parent to a later cycle. This is not always a valid solution. 


7.10.1 C Code With Live-Too-Long Problem 


Example 7-56 shows C code with a live-too-long problem that cannot be 
solved by rescheduling the parent instruction. Although it is not obvious from 
the C code, the dependency graph in Figure 7-19 on page 7-104 shows a split- 
join path that causes this live-too-long problem. 


Example 7-56. Live-Too-Long C Code 


{ 


} 


int live_long(short a[],short b[],short c, short d, short e) 


int i,sum0,suml1,sum,a0,a2,a3,b0,b2,b3; 
short al,bl; 


sum0 = 0; 

suml = 0; 

for (i=0; i<100; itt) { 
0 = afi] * c; 

al = a0 >> 15; 
2=al* d; 


sum = sum0O + suml; 
return (sum) ; 
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7.10.2 Translating C Code to Linear Assembly 


Example 7-57 shows the assembly instructions that execute the inner loop in 
Example 7-56. 


Example 7-57. Linear Assembly for Live-Too-Long Inner Loop 


LDH *aptrt++,ai ; load ai from memory 

LDH *bptr++,bi ; load bi from memory 

MPY ai,c,a0 ; aO =ai*ec 

SHR a0,15,al ; al = a0 >> 15 

MPY al,d,a2 , a2=alxad 

ADD a2,a0,a3 ; a3 = a2 + al 

ADD sum0,a3,sum0 ; sum0 += a3 

MPY bi, c,b0 ; bO = bi*ec 

SHR b0,15,b1 ; bl = bO >> 15 

MPY b1,e,b2 , b2 =bl *e 

ADD b2,b0,b3 ; b3 = b2 + bO 

ADD suml,b3,suml ; suml += b3 
{entr]SUB  cntr,1,cntr ; decrement loop counter 
[cntr]B LOOP ; branch to loop 


7.10.3 Drawing a Dependency Graph 


Figure 7-19 shows the dependency graph for the live-too-long code. This 
algorithm includes three separate and independent graphs. Two of the inde- 
pendent graphs have split-join paths: from a0 to a3 and from b0 to b3. 
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Figure 7-19. Dependency Graph of Live-Too-Long Code 
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7.10.4 Determining the Minimum Iteration Interval 


Table 7—22 shows the functional unit resources for the loop. Based on the re- 
source usage, the minimum iteration interval is 2 for the following reasons: 


_) No specific resource is used more than twice, implying a minimum itera- 
tion interval of 2. 


_j A total of five non-.M units on each side also implies a minimum iteration 
interval of 2, because three non-.M units can be used on a side during each 
cycle. 


Table 7-22. Resource Table for Live-Too-Long Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) Instructions Total/Unit 
.M1 MPY 1 .M2 MPY 1 

S1 B and SHR 2 .S2 SHR 1 

.D1 LDH 1 .D2 LDH 1 
.L1,.S1,or.D1 2ADDs 2 .L2,.S2,or.D2 2ADDs and SUB 3 
Total non-.M units 5 Total non-.M units 5 


However, the minimum iteration interval is determined by both resources and 
data dependency. A loop carry path determined the minimum iteration interval 
of the IIR filter in section 7.7, Loop Carry Paths, on page 7-78. In this example, 
a live-too-long problem determines the minimum iteration interval. 


7.10.4.1 Split-Join-Path Problems 


In Figure 7—19, the two split-join paths from a0 to a3 and from b0 to b3 create 
the live-too-long problem. Because the ADD a3 instruction cannot be sched- 
uled until the SHR a1 and MPY a2 instructions finish, a0 must be live for at least 
four cycles. For example: 


) If MPY aOis scheduled on cycle 5, then the earliest SHR a1 can be sched- 
uled is cycle 7. 


Lj The earliest MPY a2 can be scheduled is cycle 8. 


_) The earliest ADD a3 can be scheduled is cycle 10. 
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Because a0 is written at the end of cycle 6, it must be live from cycle 7 to 
cycle 10, or four cycles. No value can be live longer than the minimum iteration 
interval, because the next iteration of the loop will overwrite that value before 
the current iteration can read the value. Therefore, if the value has to be live 
for four cycles, the minimum iteration interval must be at least 4. A minimum 
iteration interval of 4 means that the loop executes at half the performance that 
it could based on available resources. 


7.10.4.2 Unrolling the Loop 


One way to solve this problem is to unroll the loop, so that you are doing twice 
as much work in each iteration. After unrolling, the minimum iteration interval 
is 4, based on both the resources and the data dependencies of the split-join 
path. Although unrolling the loop allows you to achieve the highest possible 
loop throughput, unrolling the loop does increase the code size. 


7.10.4.3 Inserting Moves 


Another solution to the live-too-long problem is to break up the lifetime of a0 
and b0 by inserting move (MV) instructions. The MV instruction breaks up the 
left path of the split-join path into two smaller pieces. 


7.10.4.4 Drawing a New Dependency Graph 
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Figure 7-20 shows the new dependency graph with the MV instructions. Now 
the left paths of the split-join paths are broken into two pieces. Each value, a0 
and a0’, can be live for minimum iteration interval cycles. If MPY a0 is sched- 
uled on cycle 5 and ADD a3 is scheduled on cycle 10, you can achieve a mini- 
mum iteration interval of 2 by scheduling MV a0’ on cycle 8. Then a0 is live on 
cycles 7 and 8, and a0’ is live on cycles 9 and 10. Because no values are live 
more than two cycles, the minimum iteration interval for this graph is 2. 
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Figure 7-20. Dependency Graph of Live-Too-Long Code (Split-Join Path Resolved) 
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7.10.5 Linear Assembly Resource Allocation 


Example 7-58 shows the linear assembly code with the functional units as- 
signed. The choice of units for the ADDs and SUB is flexible and represents 
one of a number of possibilities. One goal is to ensure that no functional unit 
is used more than the minimum iteration interval, or two times. 


The two 2X paths and one 1X path are required because the values c, d, and 
e reside on the side opposite from the instruction that is reading them. If these 
values had created a bottleneck of resources and caused the minimum itera- 
tion interval to increase, c, d, and e could have been loaded into the opposite 
register file outside the loop to eliminate the cross path. 
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Example 7-58. Linear Assembly for Full Live-Too-Long Code 


LOOP: 


Lentr] 
bentx] 


_live_long: 


-global _live_long 


.cproc a, b, Gc, cd; 

reg ai, bi, sum0, suml, 
reg aOp, a_O, a_l, a_2, b_0, 
MVK 100,cntr ; 
ZERO sum0 i 
ZERO suml ; 
-trip 100 

LDH -D1 *katt+,ai ; 
LDH .D2 *b++,bi ; 
MPY -M1 ai,c,a_0 ; 
SHR ook a_0,15,a_l ; 
MPY -M1X a_l,d,a_2 r 
MV D1 a_0,a0p ; 
ADD adack. a_2,a0p,a_3 ; 
ADD Pas sum0, a_3, sum0 ; 
MPY -M2X bi;e,b_0 , 
SHR OL b_0,15,b_1 ; 
MPY M2X b_1,e,b_2 ; 
MV .D2 b_0,b0p 7 
ADD 202 b_2,b0p,b_3 ; 
ADD eZ suml,b_3, suml A 
SUB ~$2 entr,;1,cntr : 
B -S1 LOOP ; 
ADD sum0,suml,sum ; 


.-return sum 


-endproc 


bOp, 


b_l, 


centr = 


sum0 


suml = 


save 


sum0 
bO = 
bl = 
b2 = 
save 
b3 = 
suml 


bi 


b_2, bi3; -cntr 
100 
0 


0 


from memory 

from memory 

eG 

>> 15 

xd 

across iterations 
+ a0 


across iterations 
+ b0 
b3 


decrement loop counter 
branch to loop 


Add sumi and sumit+l for ret value 
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7.10.6 Final Assembly With Move Instructions 


Example 7-59 shows the final assembly code after software pipelining. The 
performance of this loop is 212 cycles (2 x100 + 11+ 1). 


Example 7-59. Assembly Code for Live-Too-Long With Move Instructions 


LDH -D1 *A4++,A0 ; load ai from memory 
| | LDH «D2 *B4++,B0 ; load bi from memory 

MVK -S2 100,B2 ; set up loop counter 

LDH «D1 *A4++,A0 7* load ai from memory 

LDH .D2 *B4++,B0 7* load bi from memory 

ZERO vod Al ; zero out accumulator 

ZERO «82 Bl ; zero out accumulator 

LDH .D1 *A4++,A0 7** load ai from memory 

LDH .D2 *B4++,B0 7** load bi from memory 
[B2] SUB sS2 B2,1,B2 ; decrement loop counter 

MPY -M1 AO,A6,A3 ; aOQ =ai*e 

MPY -M2X BO,A6,B10 7; bO = bi*ec 

LDH .D1 *A4++,A0 7;*** load ai from memory 

LDH .D2 *B4++,B0 7*** load bi from memory 
[B2] SUB ~S2 B2,1,B2 ; decrement loop counter 
[B2] B Prowl d LOOP 7 branch to loop 

SHR Bacall A3,15,A5 7 al = all >> 15 

SHR .S2 B10,15,B5 ; bl = bO >> 15 

MPY -M1 AO,A6,A3 7;* aO =ai*ec 

MPY -M2X BO,A6,B10 7* bO = bi * c 

LDH «D1 *A4++,A0 7;**** load ai from memory 

LDH .D2 *B4++,B0 7;**** load bi from memory 

MPY -M1X A5,B6,A7 7; a2=alr*ad 

MV -D1 A3,A2 ; save a0 across iterations 

MPY -M2X B5,A8,B7 ; b2 =bl * e 

MV -D2 B10,B8 ; save bO across iterations 
[B2] SUB ~S2 B2,1,B2 7* decrement loop counter 
[B2] B 2S LOOP 7* branch to loop 

SHR Sl A3,15,A5 7* al = a0 >> 15 

SHR 2S2 B10,15,B5 7* bl = bO >> 15 

MPY -M1 AO,A6,A3 7** aQ = ai*ec 

MPY -M2X BO,A6,B10 axe 50 Sibi * 6 

LDH -D1 *A4++,A0 7**x*x** Load ai from memory 

LDH .D2 *B4++,B0 7**x*x** Load bi from memory 
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Example 7-59. Assembly Code for Live-Too-Long With Move Instructions (Continued) 


LOOP: 


D -L1 A7,A2,A9 

D .L2 B7,B8,B9 

Y -M1X A5,B6,A7 
-D1 A3,A2 

Y .M2X  B5,A8,B7 
.D2 B10,B8 

B -S2 B2,1,B2 
~S1 LOOP 

D -L1 Al1,A9,Al1 

D .L2 B1,B9,B1 

R <S1 A3,15,A5 

R +52 B10,15,B5 

Y -M1 AO,A6,A3 

Y -M2X BO,A6,B10 

H {Di *R4++,A0 

H D2 *B4++,BO 

Branch occurs here 

D .L1X Al1,B1,A4 


* a3 = a2 + ad 
* b3 = b2 + bO 
7* a2=al*d 
* save aO across iterations 
7* b2 = bl * e 

7* save bO across iterations 
7** decrement loop counter 
7** branch to loop 


7 sum0 += a3 
; suml += b3 


7** al = a0 >> 15 
7** bl = bO >> 15 
pex* a0 = al * ¢ 


7*** bO = bi * c 
7****** Load ai from memory 
7****** Load bi from memory 


7 sum = sum0O + suml 
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7.11 Redundant Load Elimination 


Filter algorithms typically read the same value from memory multiple times and 
are, therefore, prime candidates for optimization by eliminating redundant load 
instructions. Rather than perform a load operation each time a particular value 
is read, you can keep the value in a register and read the register multiple 
times. 


7.11.1 FIR Filter C Code 


Example 7-60 shows C code for a simple FIR filter. There are two memory 
reads (x{i+j] and h[i]) for each multiply. Because the ’C6x can perform only two 
LDHs per cycle, it seems, at first glance, that only one multiply-accumulate per 
cycle is possible. 


Example 7-60. FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 


int i, Jj, sum; 


for (j = 0; 3 < 100; j++) { 
sum = 0; 
for (i = 0; i < 32; i++) 
sum += x[itj] * h[i]; 
y{j] = sum >> 15; 


One way to optimize this situation is to perform LDWs instead of LDHs to read 
two data values at atime. Although using LDW works for the h array, the x array 
presents a different problem because the ’C6x does not allow you to load 
values across a word boundary. 


For example, on the first outer loop (j = 0), you can read the x-array elements 
(0 and 1, 2 and 3, etc.) as long as elements 0 and 1 are aligned on a 4-byte 
word boundary. However, the second outer loop (j= 1) requires reading x-array 
elements 1 through 32. The LDW operation must load elements that are not 
word-aligned (1 and 2, 3 and 4, etc.). 


7.11.1.1 Redundant Loads 


In order to achieve two multiply-accumulates per cycle, you must reduce the 
number of LDHs. Because successive outer loops read all the same h-array 
values and almost all of the same x-array values, you can eliminate the redun- 
dant loads by unrolling the inner and outer loops. 


For example, x[1] is needed for the first outer loop (x[j+1] with j = 0) and for the 
second outer loop (x{j] with j = 1). You can use a single LDH instruction to load 
this value. 
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7.11.1.2 New FIR Filter C Code 


Example 7-61 shows that after eliminating redundant loads, there are four 
memory-read operations for every four multiply-accumulate operations. Now 
the memory accesses no longer limit the performance. 


Example 7-61. FIR Filter C Code With Redundant Load Elimination 


void fir(short x[], short h[], short y[]) 
{ 

int i, j, sum0, suml; 

short x0,x1,h0,h1; 


for (j = 0; 3 < 100; jt=2) { 


sum0 = 0; 
suml = 0; 
x0 = x[J]; 


for (i = 0; i < 32; i+=2) { 
xl = x[jt+it+l]; 
ho = h[il; 
sum0 += x0 * h0; 
suml += x1 * hO; 
xO = x[Jj+it2]; 
hl = h[itl]; 
sum0O += xl * hil; 
suml += x0 * hil; 
} 

yj] = sum0O >> 15; 

y[jt1] = suml >> 15; 
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7.11.2 Translating C Code to Linear Assembly 


Example 7-62 shows the linear assembly that perform the inner loop of the 
FIR filter C code. 


Element x0 is read by the MPY p00 before it is loaded by the LDH x0 instruc- 
tion; x[j] (the first xO) is loaded outside the loop, but successive even elements 
are loaded inside the loop. 


Example 7-62. Linear Assembly for FIR Inner Loop 


hetr] 
Petr] 


DH 
DH 


.D2 kx 144+[2],x1 ; xl = x[jtitl] 
2Dd *h++[2],h0 ; ho = h[i] 

.M1 x0,h0,p00 7 x0 * ho 

.M1X x1,h0,p10 * xi * ho 

-L1 p00, sum0, sum0 ; sumO += x0 * hO 
.L2X pl0,suml1,suml ; suml += xl * hO 
Did *x++[2],x0 ; xO = x[j+it+2] 
.D2 *h_14+4+[2],h1 ; hl = h[itl] 

.M2 x1,h1,p01 Hamer al aaa al 

.-M2X x0,h1,pl1l1 ie SO. = ik 

-L1X p01, sum0, sum0 ; sum0O += x1 * hil 
~L2 pll1,suml1,suml ; suml += x0 * hl 
~S2 ctr, 1, ctr ; decrement loop counter 
52 LOOP ; branch to loop 
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7.11.3 Drawing a Dependency Graph 


Figure 7-21 shows the dependency graph of the FIR filter with redundant load 
elimination. 


Figure 7-21. Dependency Graph of FIR Filter (With Redundant Load Elimination) 
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7.11.4 Determining the Minimum Iteration Interval 


Table 7-23 shows that the minimum iteration interval is 2. An iteration interval 
of 2 means that two multiply-accumulates are executing per cycle. 


Table 7-23. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit |] Unit(s) Instructions Total/Unit 
.M1 2 MPYs 2 .M2 2 MPYs 2 

S1 0 .S2 B 1 

.D1 2 LDHs 2 .D2 2 LDHs 2 
.L1,.S1,or.D1  2ADDs 2 .L2, .S2, .D2 2 ADDs and SUB 3 
Total non-.M units 4 Total non-.M units 6 

1X paths 2 2X paths 2 


7.11.5 Linear Assembly Resource Allocation 


Example 7-63 shows the linear assembly with functional units and registers 
assigned. 


Example 7-63. Linear Assembly for Full FIR Code 


-global _fir 
Firs -cproc x, h, y 
.reg x_l, h_l, sum0, suml, ctr, octr 
.reg p00, pOl, pl0O, pill, x0, x1, hO, hl, rstx, rsth 
ADD by, 2; Dek ; set up pointer to h[1] 
MVK 50,octr ; outer loop ctr = 100/2 
MVK 64,rstx ; used to rst x pointer each outer loop 
MVK 64,rsth ; used to rst h pointer each outer loop 
OUTLOOP: 
ADD M2 7 set up pointer to x[j+1] 
SUB 1,2, hh 7 set up pointer to h[0] 
MVK L6,6tr ; inner loop ctr = 32/2 
ZERO sum0 ; sum0 = 0 
ZERO suml ; suml = 0 
[octr] SUB octr,;1,octr ; decrement outer loop counter 
LDH .D1 *x++ [2], x0 , x0 = x[j] 
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Example 7-63. Linear Assembly for Full FIR Code (Continued) 


LOOP: 


[eer] 
[cer] 


foctr] 


sera. 16 


DH 
DH 


SUB 


-endproc 


.D2 ee 14+[2],x1 ; Xl = x[jtit+l] 

«D1 *h++[2],h0 ; ho = h[i] 

-M1 x0,h0,p00 y xO * ho 

.M1X x1,h0,p10 7; xl * ho 

-L1 p00, sum0, sum0 ; sum0 += x0 * hO 

.L2X pl10,suml,suml ; suml += xl * hO 

-D1 *x++[2],x0 ; xO = x[j+it+2] 

.D2 *h_14+4+[2],h1 ; hl = h[it+l] 

.M2 x1,h1,p01 * x1. * AL 

-M2X x0,h1,pl1l1 7 x0 * hil 

-L1X p01, sum0, sum0 ; sum0O += x1 * hi 

.L2 pll,suml1,suml ; suml += x0 * hil 

+S2 chr, id, crr ; decrement loop counter 
«82 LOOP 7 branch to loop 
sum0,15, sum0 >; sum0 >> 15 
suml1,15,suml ; suml >> 15 

sum0, *y++ 7 ylj] = sum0O >> 15 
suml, *y++ + yljt+1] = suml >> 15 

x, rstx,x ; reset x pointer to x[j] 
h_l,rsth,h_1 ; reset h pointer to h[0] 
OUTLOOP 7 branch to outer loop 
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Example 7-64 shows the final assembly for the FIR filter without redundant 
load instructions. At the end of the inner loop is a branch to OUTLOOP that 
executes the next outer loop. The outer loop counter is 50 because iterations 
j and j + 1 execute each time the inner loop is run. The inner loop counter is 
16 because iterations i andi + 1 execute each inner loop iteration. 


The cycle count for this nested loop is 2352 cycles: 50 (16 x 2+9+6)+2. 
Fifteen cycles are overhead for each outer loop: 


(J Nine cycles execute the inner loop prolog. 
.) Six cycles execute the branch to the outer loop. 


See section 7.13, Software Pipelining the Outer Loop, on page 7-132 for in- 
formation on how to reduce this overhead. 
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Example 7-64. Final Assembly Code for FIR Filter With Redundant Load Elimination 


OUTLOOP: 


[A2] 


[B2] 


[B2] 


[B2] 


[B2] 


nN OUO 


D 
ZERO 
ERO 


-S1 


-S1 
-S2 


+ Da: 
~L2X 
-D2 
.L1X 
-S2 
-S1 


-D1 
-D2 
-L1 
~L2 


-D2 
-D1 


-D1 
-D2 


-S2 
-D2 
-D1 


-S2 
Dd 
-D2 


-M1 
-S2 
-D2 
-D1 


M2 
.M1X 
-S2 
-D1 
«D2 


.M2X 
M1 
-S2 
-D2 
-D1 


50,A2 


80, A3 
82,B6 


*A4++[2],A0 
A4,2,B5 
B4,2,B4 
B4,0,A5 
16,B2 
A2,1,A2 


*A5++[2],Al1 
*B5++[2],Bl 
A9 
B9 


*B4++[2],BO 
*A4++[2],A0 


*A5++[2],Al1 
*B5++[2],Bl 


B2,1,B2 
*B4++[2],B0 
*A4++[2],A0 


LOOP 
*A5++[2],Al1 
*B5++[2],Bl 


AO,A1,A7 
B2,1,B2 
*B4++[2],BO 
*A4++[2],A0 


B1,B0O,B7 
B1,A1,A8 
LOOP 
*A5++[2],Al1 
*B5++[2],Bl 


AO,BO,B8 
AO,A1,A7 
B2,1,B2 
*B4++[2],B0 
*A4++[2],A0 


; set up outer loop counter 


; used to rst x ptr outer loop 
; used to rst h ptr outer loop 


7; x0 = x[j] @) 


; set up pointer to x[j+1] 

7 set up pointer to h[1] 

; set up pointer to h[0] 

; set up inner loop counter 

; decrement outer loop counter 


; ho = hfi] ) 


; Xl = x[jtit+l1] 
; zero out sum0 
; zero out suml 


; hl = h[itl] @) 


7; x0 = x[Jt+it+2] 


;* hO = hfi] @) 


7* x1 = x[Jjtitl] 


; decrement inner loop counter 6) 
7* hl = h[itl] 
;* xO = x[JtitZ] 


; branch to inner loop ©) 
7** hO = h[i] 
pe* x1 = x(j+ri+L] 


3; x0 * hO @ 


7* decrement inner loop counter 
7** hl = h[itl] 
7** x0 = x[Jjtit2] 


>; xl * hl 


a xl * AO 

7* branch to inner loop 
7*** hO = h[i] 

7*** xl = x[Jj+itl] 


; xO * hl @) 
7;* x0 * ho 

7** decrement inner loop counter 
7*** hl = h[itl] 

peek xO = x [JFit+2] 
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Example 7-64 Final Assembly Code for FIR Filter With Redundant Load Elimination 


r 


outer loop branch occurs here 


(Continued) 
LOOP: 
ADD .L2X A8,B9,B9 ; suml += x1 * hO 
| ADD Paral A7,A9,A9 ; sum0 += x0 * hO 
MPY .M2 B1,B0,B7 ;* xl * hl 
| MPY -M1X  B1,A1,A8 ;* xl * ho 
| [B2] B ~S2 LOOP 7** branch to inner loop 
| LDH -D1 *A5++[2],Al1 ppeeSe HO = hips) 
LDH «D2 *B5++([2],B1 pRRee® S21 = x [itd] 
ADD ~L1X B7,A9,A9 ; sumO += x1 * hl 
ADD .B2 B8,B9,B9 ; suml += x0 * hl 
MPY .M2X  AO,BO,B8 ;* x0 * hl 
| MPY -M1 AO,A1,A7 7** x0 * ho 
| [B2] SUB Se B2,1,B2 7*** decrement inner loop cntr 
| LDH -D2 *B4++[2],B0 pe*e* Hh1 = h[itl] 
| LDH -D1 *A4++[2],A0 7**** XO = x[J+it2] 
; inner loop branch occurs here 
[A2] B ~S1 OUTLOOP ; branch to outer loop @) 
SUB -L1 A4,A3,A4 ; reset x pointer to x[j] 
| SUB «le B4,B6,B4 ; reset h pointer to h[0] 
SHR .Sl AQ, 15,A9 ; sum0 >> 15 @) 
SHR -S2 B9,15,B9 ; suml >> 15 
STH DL AS, *A6++ >, y[j] = sum0 >> 15 GB) 
STH .D1 BO, *A6++ ; ylj+1] = suml >> 15 @) 
NOP Z ; branch delay slots @) 
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7.12 Memory Banks 


The internal memory of the ’C6x family varies from device to device. See the 
TMS320C62x/C67x Peripherals Reference Guide to determine the memory 
blocks in your particular device. This section discusses how to write code to 
avoid memory bank conflicts. 


Most ’C6x devices use an interleaved memory bank scheme, as shown in 
Figure 7-22. Each number in the boxes represents a byte address. A load byte 
(LDB) instruction from address 0 loads byte 0 in bank 0. A load halfword (LDH) 
from address 0 loads the halfword value in bytes 0 and 1, which are also in 
bank 0. An LDW from address 0 loads bytes 0 through 3 in banks 0 and 1. 


Because each bank is single-ported memory, only one access to each bank 
is allowed per cycle. Two accesses to a single bank in a given cycle result in 
a memory stall that halts all pipeline operation for one cycle, while the second 
value is read from memory. Two memory operations per cycle are allowed 
without any stall, as long as they do not access the same bank. 


Figure 7-22. 4-Bank Interleaved Memory 


1 o 3 4 5 6 7 
9 10 11 12 13 14 15 


8N | 8N+1] |8N+2/8N+3] |8N+4/8N+5] |8N+6/8N+7 


Bank 0 Bank 1 Bank 2 Bank 3 
For devices that have more than one memory block (see Figure 7—23), an 


access to bank 0 in one block does not interfere with an access to bank 0 in 
another memory block, and no pipeline stall occurs. 
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Figure 7-23. 4-Bank Interleaved Memory With Two Memory Blocks 


Memory 
blacked ; 1 2 3 4 5 6 7 


9 10 11 12 13 14 15 


Memory 8M |8M+1 8M + 2/8M+3 8M+4/8M+5 8M + 6|8M + 7 
block 1 


Bank 0 Bank 1 Bank 2 Bank 3 


If each array in a loop resides in a separate memory block, the 2-cycle loop 
in Example 7-61 on page 7-112 is sufficient. This section describes a solution 
when two arrays must reside in the same memory block. 
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7.12.1 FIR Filter Inner Loop 


Example 7-65 shows the inner loop from the final assembly in Example 7-64. 
The LDHs from the h array are in parallel with LDHs from the x array. If x[1] is 
on an even halfword (bank 0) and h[O] is on an odd halfword (bank 1), 
Example 7-65 has no memory conflicts. However, if both x[1] and h[0] are on 
an even halfword in memory (bank 0) and they are in the same memory block, 
every cycle incurs a memory pipeline stall and the loop runs at half the speed. 


Example 7-65. Final Assembly Code for Inner Loop of FIR Filter 


LOOP: 


[B2] 


[B2] 


.L2X A8,B9,B9 ; suml += xl * hO 

-L1 A7,A9,A9 ; sum0 += x0 * hO 

.M2 B1,BO,B7 ;* x1 * hil 

.M1X B1,A1,A8 ok x1 * hO 

«82 LOOP 7** branch to inner loop 
-D1 *A5++[2],Al1 7**** hO = h[il] 

.D2 *B5++[2],Bl peek SL = eI HIL+L] 

.L1X B7,A9,A9 ; sum0O += x1 * hil 

.L2 B8,B9,B9 ; suml += x0 * hil 

.M2X AO,BO,B8 ae ox hi 

-M1 AO,A1,A7 7** xO * HO 

soe B25; 1,582 7;*** decrement inner loop cntr 
.D2 *B4++[2],BO 7***e H1 = h[it+l] 

«D1 *A4++[2],A0 SREEE SOS x FI. ] 


Itis not always possible to fully control how arrays are aligned, especially if one 
of the arrays is passed into a function as a pointer and that pointer has different 
alignments each time the function is called. One solution to this problem is to 
write an FIR filter that avoids memory hits, regardless of the x and h array align- 
ments. 


If accesses to the even and odd elements of an array (h or x) are scheduled 
onthe same cycle, the accesses are always on adjacent memory banks. Thus, 
to write an FIR filter that never has memory hits, even and odd elements of the 
same array must be scheduled on the same loop cycle. 
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In the case of the FIR filter, scheduling the even and odd elements of the same 
array on the same loop cycle cannot be done in a 2-cycle loop, as shown in 
Figure 7—24. In this example, a valid 2-cycle software-pipelined loop without 
memory constraints is ruled by the following constraints: 


Lj 
LL] 
L) 


L] 
L] 


LDH h0 and LDH hi are on the same loop cycle. 
LDH x0 and LDH x1 are on the same loop cycle. 


MPY p00 must be scheduled three or four cycles after LDH x0, because 
it must read x0 from the previous iteration of LDH x0. 


All MPYs must be five or six cycles after their LDH parents. 


No MPYs on the same side (A or B) can be on the same loop cycle. 


Figure 7-24. Dependency Graph of FIR Filter (With Even and Odd Elements of 
Each Array on Same Loop Cycle) 
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A side 


Note: Numbers in bold represent the cycle the instruction is scheduled on. 


The scenario in Figure 7-24 almost works. All nodes satisfy the above 
constraints except MPY p10. Because one parent is on cycle 1 (LDH h0) and 
another on cycle 0 (LDH x1), the only cycle for MPY p10 is cycle 6. However, 
another MPY on the A side is also scheduled on cycle 6 (MPY p00). Other 
combinations of cycles for this graph produce similar results. 
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7.12.2 Unrolled FIR Filter C Code 


The main limitation in solving the problem in Figure 7—24 is in scheduling a 2- 
cycle loop, which means that no value can be live more than two cycles. In- 
creasing the iteration interval to 3 decreases performance. A better solution 
is to unroll the inner loop one more time and produce a 4-cycle loop. 


Example 7-66 shows the FIR filter C code after unrolling the inner loop one 
more time. This solution adds to the flexibility of scheduling and allows you to 
write FIR filter code that never has memory hits, regardless of array alignment 
and memory block. 


Example 7-66. FIR Filter C Code (Unrolled) 


{ 


void fir(short x[], short h[], short y[]) 


int i, j, sum0, suml; 
short *0,;x1,*2,x3,h0,;h1,h2,h35 


for (j = 0; 43 < 100; jt+=2) { 


sum0O = 0; 

suml = 0; 

x0 = x[Jjl; 

for (i = 0; i < 32; i+=4) { 
xl = x[jtit+l]; 
ho = h[il; 
sum0 += x0 * hO; 
suml += x1 * h0; 
x2 = x[jt+it+2]; 
hl = h[itl]; 
sum0O += xl * hil; 
suml += x2 * hil; 
x3. = x[j+ti+t3]; 
h2 = h[it+2]; 
sum0 += x2 * h2; 
suml += x3 * h2; 
xO = x[Jj+it+4]; 
h3 = h[it+3]; 


sum0 += x3 * h3; 
suml += x0 * h3; 
} 

yij] = sumO >> 15; 

y[j+1] = suml >> 15; 
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7.12.3 Translating C Code to Linear Assembly 


Example 7-67 shows the linear assembly for the unrolled inner loop of the FIR 


filter C code. 


Example 7-67. Linear Assembly for Unrolled FIR Inner Loop 


Lentxr] SUB 
[entr] B 


ett, x1 
*h++,h0 
x0,h0,p00 
x1,h0,p10 
p00, sum0, sum0 
pl10,suml1,suml 


ext KD 
*ht++,hl 
x1,h1,p01 
x2,h1,pl1l1 
p01, sum0, sum0 
pll1,suml1,suml 


eet+, x3 
*ht++,h2 
x2,h2,p02 
x3,h2,p12 
p02, sum0, sum0 
pl2,suml1,suml 


exet++, x0 
*h++,h3 
x3,h3,p03 
x0,h3,p13 
p03, sum0, sum0 
p13,suml1,suml 


entry, 1, cntr 
LOOP 


xl = x[jtit+1] 


ho = h[ij 
x0 * ho 
xl * ho 


sum0 += x0 * hO 
suml += xl * hO 


x2 = x[Jjt+it+2] 


hl = h[itl] 
xl * hil 
x2 * hil 


sum0 += x1 * hl 
suml += x2 * hl 


x3 = x[Jjt+it+3] 
= h[it2] 

x2 * h2 

x3 * h2 

x2 * h2 

= x3 * h2 


nn 
ge¢ 
a3 
ro 
+ + 
| 


xO = x[Jjt+it+4] 
= h[it3] 

x3 * h3 

x0 * h3 

sum0 += x3 * h3 

suml += x0 * h3 


decrement loop counter 
branch to loop 
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7.12.4 Drawing a Dependency Graph 


Figure 7-25 shows the dependency graph of the FIR filter with no memory 
hits. 


Figure 7-25. Dependency Graph of FIR Filter (With No Memory Hits) 


A side B side 
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7.12.5 Linear Assembly for Unrolled FIR Inner Loop With .mptr Directive 


Example 7—68 shows the unrolled FIR inner loop with the .mptr directive. The 
.mptr directive allows the assembly optimizer to automatically determine if two 
memory operations have a bank conflict by associating memory access infor- 
mation with a specific pointer register. 


If the assembly optimizer determines that two memory operations have a bank 
conflict, then it will not schedule them in parallel. The .mptr directive tells the 
assembly optimizer that when the specified register is used as a memory point- 
er inaload or store instruction, it is initialized to point at a base location + <off- 
set>, and is incremented a number of times each time through the loop. 


Without the .mptr directives, the loads of x1 and h0 are scheduled in parallel, 
and the loads of x2 and h1 are scheduled in parallel. This results in a 50% 
chance of a memory conflict on every cycle. 


Example 7-68. Linear Assembly for Full Unrolled FIR Filter 


fir? 


OUTLOOP: 


Loctr] 


-global 
~cproc 
.reg 
.reg 


.reg 


ADD 
MVK 


fir 

x, h, y 

xl, HO, sum0, suml, ctr, octr 

p00, pOl, p02, p03, p10, pill, pl2, p13 

x0, xl, x2, x3, BO, Hl, ha, h3, rstx, rath 

hy 2, Hel ; set up pointer to h[1] 

50, octr ; outer loop ctr = 100/2 

64,rstx 7 used to rst x pointer each outer loop 
64,rsth ; used to rst h pointer each outer loop 
x, 2, x_1 ; set up pointer to x[jt1] 

ho ,2,h ; set up pointer to h[0] 

8,etr ; inner loop ctr = 32/2 

sum0 ; sum0 = 0 

suml ; suml = 0 

octr, 1,;octr ; Gecrement outer loop counter 

se xt+0 

x Sere 

slp h+0 

idk, Tae 

.D2 ee++[2],x0 , xO = x[j] 
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Example 7-68. Linear Assembly for Full Unrolled FIR Filter (Continued) 


LOOP: 


[ctr] 
peer] 


Di *x 14++([2],x1 
Di *ht++[2],h0 
-M1X x0,h0,p00 

.M1 x1,h0,p10 

-L1 p00, sum0, sum0 
.L2X pl0,suml1,suml 
.D2 *x++[2],x2 
.D2 *h_1++[2],h1 
.M2X x1,hl1,p01 

.M2 *2,h1,pl11 
-L1X p01, sum0, sum0 
-L2 pll,suml1,suml 
.D1 *x 14++[2],x3 
-D1 *ht++[2],h2 
.M1X *2,h2,p02 

.M1 *3,h2,p12 

ra ire p02,sum0, sum0 
-L2X pl2,suml1,suml 
.D2 *x++[2],x0 
.D2 *h_1++[2],h3 
-M2X x3,h3,p03 

.M2 x0,h3,p13 

~ LIX p03,sum0, sum0 
+L2 pl3,suml1,suml 
«82 ctr, 1,ectr 

52 LOOP 


sum0,15, sum0 
suml1,15,suml 
sum0, *y++ 
suml, *y++ 

xX, rstx,xX 
h_1,rsth,h_1 
OUTLOOP 


xl = x[jtit+1] 


ho = hf[il] 
x0 * ho 
xl * ho 


sum0 += x0 * hO 
suml += x1 * hO 


x2 = x[Jjtit+2] 


hl = h[itt] 
xl * hil 
x2 * hil 


sum0 += x1 * hl 
suml += x2 * hl 


x3 = x[Jjt+it+3] 


h2 = h[it2] 
x2 * h2 
x3 * h2 


sum0 += x2 * h2 
suml += x3 * h2 


xO = x[Jj+it+4] 


h3 = h[it3] 
x3 * h3 
x0 * h3 


sum0 += x3 * h3 
suml += x0 * h3 


decrement loop counter 
branch to loop 


sum0 >> 15 

suml >> 15 

ylj3] = sum0 >> 15 
y[jt1l] = suml >> 15 
reset x pointer to x[j] 
reset h pointer to h[0] 
branch to outer loop 


Optimizing Assembly Code via Linear Assembly 


7-127 


Part Ill 


Part Ill 


Memory Banks 


7.12.6 Linear Assembly Resource Allocation 
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As the number of instructions in a loop increases, assigning a specific register 
to every value in the loop becomes increasingly difficult. If 33 instructions in 
a loop each write a value, they cannot each write to a unique register because 
the ’C6x has only 32 registers. As a result, values that are not live on the same 
cycles in the loop must share registers. 


For example, in a 4-cycle loop: 


(j Ifavalue is written at the end of cycle 0 and read on cycle 2 of the loop, 
it is live for two cycles (cycles 1 and 2 of the loop). 


(1 If another value is written at the end of cycle 2 and read on cycle 0 (the next 
iteration) of the loop, itis also live for two cycles (cycles 3 and 0 of the loop). 


Because both of these values are not live on the same cycles, they can occupy 
the same register. Only after scheduling these instructions and their children 
do you know that they can occupy the same register. 


Register allocation is not complicated but can be tedious when done by hand. 
Each value has to be analyzed for its lifetime and then appropriately combined 
with other values not live on the same cycles in the loop. The assembly opti- 
mizer handles this automatically after it software pipelines the loop. See the 
TMS320C6x Optimizing C Compiler User’s Guide for more information. 
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7.12.7 Determining the Minimum Iteration Interval 


Based on Table 7—24, the minimum iteration interval for the FIR filter with no 
memory hits should be 4. An iteration interval of 4 means that two multiply/ac- 
cumulates still execute per cycle. 


Table 7-24. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) Instructions Total/Unit 
.M1 4 MPYs 4 .M2 4 MPYs 4 

S1 0 .S2 B 1 

.D1 4 LDHs 4 .D2 4 LDHs 4 
.L1,.S1,or.D1 4ADDs 4 .L2,.S2,or.D2 4ADDs and SUB 5 
Total non-.M units 8 Total non-.M units 10 

1X paths 4 2X paths 4 


7.12.8 Final Assembly 


Example 7-69 shows the final assembly to the FIR filter with redundant load 
elimination and no memory hits. At the end of the inner loop, there is a branch 
to OUTLOOP to execute the next outer loop. The outer loop counter is set to 
50 because iterations j and j+1 are executing each time the inner loop is run. 
The inner loop counter is set to 8 because iterations i,i+1,i+2,andi+3 are 
executing each inner loop iteration. 
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7.12.9 Comparing Performance 


The cycle count for this nested loop is 2402 cycles. There is a rather large 
outer-loop overhead for executing the branch to the outer loop (6 cycles) and 
the inner loop prolog (10 cycles). Section 7.13 addresses how to reduce this 
overhead by software pipelining the outer loop. 


Table 7-25. Comparison of FIR Filter Code 


Code Example 


Example 7-64 


Example 7-69 


Cycles Cycle Count 
FIR with redundant load elimination 50 (16 xX 2+9+6)+2 2352 
FIR with redundant load elimination and no 50 (8 x 44+ 10+6)+2 2402 


memory hits 
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Example 7-69. Final Assembly Code for FIR Filter With Redundant Load Elimination 


and No Memory Hits 
MVK .S1 50,A2 ; set up outer loop counter 
MVK -S1 62,A3 ; used to rst x pointer outloop 
| | MVK -S2 64,B10 , used to rst h pointer outloop 
OUTLOOP: 
LDH -D1 *A4++,B5 ; x0 = x[Jj] 
ADD .L2X A4,4,Bl1 ; set up pointer to x[j+2] 
ADD .L1X B4,2,A8 ; set up pointer to h[1] 
MVK -S2 8,B2 7 set up inner loop counter 
[A2] SUB ~S1 A2,1,A2 ; decrement outer loop counter 
LDH -D2 *B1++[2],B0O ; x2 = x[jt+it2] 
LDH 2 Da *A4++[2],A0 ; xl = x[jtitl1] 
ZERO eee AQ ; zero out sum0 
ZERO -L2 BY ; zero out suml 
LDH D1 *A8++[2],B6 ; hl = h[itl] 
LDH -D2 *B4++[2],Al ; hO = h[il] 
LDH -D1 *A4++[2],A5 7 *3 = x[j+it+3] 
LDH -D2 *B1++[2],B5 ; xO = x[j+it+4] 
LDH .D2 *B4++[2],A7 ; h2 = h[it2] 
LDH -D1 *A8++[2],B8 7; h3 = h[it3] 
[B2] SUB ~S2 B2,1,B2 ; decrement loop counter 
LDH D2 *B1++[2],B0O 7* x2 = x[Jj+it2] 
LDH 2D. *A4++[2],A0 7* xl = x[jtit1] 
LDH Da. *A8++[2],B6 7;* hl = h[itl] 
LDH .D2 *B4++[2],Al 7* hO = hf[il 
MPY -M1X B5,A1,A0 ; xO * hO 
MPY ~M2X AO,B6,B6 esl * hi 
LDH -D1 *A4++[2],A5 7* x3 = x[j+it+3] 
LDH -D2 *B1++[2],B5 7* xO = x[j+it+4] 
[B2] B Aasyl LOOP ; branch to loop 
MPY .M2 BO,B6,B7 7; *2 * hil 
MPY -M1 AO,A1,A1 ; xl * ho 
LDH .D2 *B4++[2],A7 ;* h2 = h[it2] 
LDH -D1 *A8++[2],B8 7* h3 = h[it3] 
[B2] SUB ~S2 B2,1,B2 7* decrement loop counter 
ADD «kid: AO,A9,A9 ; sumO += x0 * hO 
MPY ~-M2X A5,B8,B8 } “xB. * hs 
MPY -M1X BO,A7,A5 2 * he 
LDH «D2 *B1++[2],BO 7** x2 = x[j+it2] 
LDH Di *A4++[2],A0 7** xl = x[jtit+l1] 
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Example 7-69. Final Assembly Code for FIR Filter With Redundant Load Elimination 


and No Memory Hits (Continued) 


LOOP: 


[B2] 
[B2] 


[B2] 
[B2] 


[B2] 


[B2] 
[B2] 
[B2] 


[B2] 
[B2] 


D 
D 


; inner loop branch occurs here 


NOP 


; outer loop branch occurs here 


~L2X 
.L1X 
.M2 
M1 
D1 
-D2 


~L2 
.L1 
-M1X 
.M2X 
-D1 
«D2 


~L2X 
.L1X 
-S1 
.M2 
M1 
-D2 
-D1 
-S2 


~L2 
-L1 
-M2X 
.M1X 
.D2 
-D1 


-S2 
-L1 
L2 
-S1 


prop 
-S2 


Al1,B9,B9 
B6,A9,A9 
B5,B8,B7 
A5,A7,A7 
*A8++[2],B6 
*B4++[2],Al 


B7,B9,B9 
A5,A9,A9 
B5,A1,A0 
AO, B6,Bé6 
*A4++[2],A5 
*B1++[2],B5 


A7,B9,B9 
B8,A9,A9 
LOOP 
BO,B6,B7 
AO,A1,Al1 
*B4++[2],A7 
*A8++[2],B8 
B2,1,B2 


B7,B9,B9 
AO,A9,A9 
A5,B8,B8 
BO,A7,A5 
*B1++[2],BO 
*A4++[2],A0 


OUTLOOP 
A4,A3,A4 
B4,B10,B4 
A9,A0,A9 


A9,15,A9 
B9,15,B9 


AQ, *A6++ 


B9, *A6++ 


suml 
sum0 
x0 * 
x3. * 


;** hl 
;** hO 


suml 
sum0 
* xQ * 
* xl * 


pet eS 
Hea 0) 


’ 
’ 


’ 


;** h2 = 


suml 
sum0 


* XQ * 
* xl * 


;** 3 


** decrement loop counter 


suml 
* sum0 
* x3 * 
* x2 * 
KKK x2 
KK*K x1 


branch to outer loop 

x pointer to x[j] 

h pointer to h[0] 

-= x0*hO (eliminate add) 


reset 
réset 
sum0 


sum0 
suml1 


yj] 
y[jtl 


branc 


+= x1 * ho 
+= x1 * hil 


= h[it1] 
= h[i] 


+= x2 * hl 
+= x2 * h2 
ho 
h1 
= x[jt+it3] 
= x[jt+it4] 


+= x3 * h2 
+= x3 * h3 


;* branch to loop 


hl 
ho 
h[it2] 
= h[it3] 


+= x0 * h3 
+= x0 * hO 
h3 
h2 
= x[jtit+2] 
= x[jtit+l] 


>> 15 
>> 15 


= sum0 >> 15 


] = suml >> 15 


h delay slots 


Optimizing Assembly Code via Linear Assembly 


7-131 


Part Ill 


Part Ill 


Software Pipelining the Outer Loop 


7.13 Software Pipelining the Outer Loop 


In previous examples, software pipelining has always affected the inner loop. 
However, software pipelining works equally well with the outer loop in anested 
loop. 


7.13.1 Unrolled FIR Filter C Code 


Example 7—70 shows the FIR filter C code after unrolling the inner loop (identi- 
cal to Example 7-66 on page 7-123). 


Example 7-70. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 

int i, j, sum0, suml; 

short x0,x1,x2,x3,;n0,h1,h2;,h3; 


for (j = 0; j < 100; jt=2) { 


sum0O = 0; 

suml = 0; 

x0 = x[J]; 

for (i = 0; i < 32; i+=4){ 
x1 = x[jtit+l]; 
ho = h[il]; 
sum0 += x0 * hO; 
suml += x1 * h0O; 
x2 = x[Jjtit+2]; 
hl = h[it+l]; 
sum0 += xl * hil; 
suml += x2 * hil; 
x3 = x[Jjt+it+3]; 
h2 = h[it+2]; 
sum0 += x2 * h2; 
suml += x3 * h2; 
x0 = x[Jjt+it+4]; 
h3 = h[it+3]; 
sum0 += x3 * h3; 


suml += x0 * h3; 
} 

y[j] = sum0 >> 15; 

y[jt1] = suml >> 15; 
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7.13.2 Making the Outer Loop Parallel With the Inner Loop Epilog and Prolog 


The final assembly code for the FIR filter with redundant load elimination and 
no memory hits (Shown in Example 7-69 on page 7-130) contained 16 cycles 
of overhead to call the inner loop every time: ten cycles for the loop prolog and 
six cycles for the outer loop instructions and branching to the outer loop. 


Most of this overhead can be reduced as follows: 


_j Put the outer loop and branch instructions in parallel with the prolog. 
.) Create an epilog to the inner loop. 
(J) Put some outer loop instructions in parallel with the inner-loop epilog. 


7.13.3 Final Assembly 


Example 7—71 shows the final assembly for the FIR filter with a software-pipe- 
lined outer loop. Below the inner loop (starting on page 7-135), each instruc- 
tion is marked in the comments with an e, p, or o for instructions relating to epi- 
log, prolog, or outer loop, respectively. 


The inner loop is now only run seven times, because the eighth iteration is 
done in the epilog in parallel with the prolog of the next inner loop and the outer 
loop instructions. 
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Example 7-71. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined 


MVK -S1 50,A2 ; set up outer loop counter 
STW D2 B11, *B15-- ; push register 
MVK -S1 74,A3 ; used to rst x ptr outer loop 
MVK .S2 72,B10 ; used to rst h ptr outer loop 
ADD ~L2X A6,2,B11 ; set up pointer to y[1] 
LDH -D1 *A4++,B8 ; x0 = x[Jj] 
ADD .L2X A4,4,Bl1 ; set up pointer to x[j+2] 
ADD .L1X B4,2,A8 7 set up pointer to h[1] 
MVK 252. 8,B2 ; set up inner loop counter 
[A2] SUB ~S1 A2,1,A2 ; decrement outer loop counter 
LDH -D2 *B1++[2],BO ; x2 = x[jt+it2] 
LDH -D1 *A4++[2],A0 ; xl = x[jtitl1] 
ZERO ead, AQ ; zero out sum0 
ZERO ole BY ; zero out suml 
LDH -D1 *A8++[2],B6 ; hl = h[itl] 
LDH ~D2 *B4++[2],Al ; hO = h[il 
LDH -D1 *A4++[2],A5 7 *3 = x[j+it+3] 
LDH 2D2 *B1++[2],B5 ; xO = x[j+it+4] 
OUTLOOP: 
LDH .D2 *B4++[2],A7 ; h2 = h[it2] 
LDH 2D *A8++[2],B8 7; h3 = h[it3] 
[B2] SUB eo B2,2,B2 ; decrement loop counter 
LDH ~DZ *B1++[2],B0O 7* x2 = x[j+it2] 
LDH 2D. *A4++[2],A0 7* xl = x[jtit1] 
LDH -D1 *A8++[2],B6 7;* hl = h[iti1] 
LDH -D2 *B4++[2],Al 7* hO = hf[il 
MPY -M1X B8,A1,A0 ; x0 * ho 
MPY ~M2X AO,B6,B6 pox * hd 
LDH Dal *A4++[2],A5 7* x3 = x[j+it+3] 
LDH «D2 *B1++[2],B5 7* xO = x[j+it+4] 
[B2] B .S1 LOOP ; branch to loop 
MPY .M2 BO,B6,B7 - <2. 2 Il 
MPY -M1 AO,A1,A1 7 x1, * ho 
LDH .D2 *B4++[2],A7 ;* h2 = h[it2] 
LDH DL *A8++[2],B8 7* h3 = h[it3] 
[B2] SUB ~S2 B2,1,B2 7* decrement loop counter 
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Example 7-71. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined (Continued) 


ADD arp AO,A9,A9 ; sumO += x0 * hO 
| MPY .M2X A5,B8,B8 ; x3 * h3 
| MPY .M1X BO,A7,A5 ; x2 * h2 
| LDH .D2 *B1++[2],BO 7** x2 = x[jt+it+2] 
| LDH <Di *A4++[2],A0 7** xl = x[jtit+1] 
LOOP: 
ADD »L2xX A1,B9,B9 ; suml += xl * hO 
ADD .L1X B6,A9,A9 ; sum0O += xl * hl 
MPY .M2 B5,B8,B7 : x0 * hs 
MPY .M1 A5,A7,A7 ; x3 * h2 
LDH .D1 *A8++[2],B6 7** hl = h[itl1] 
LDH .D2 *B4++[(2],Al 7** ho = h[i] 
ADD -L2 B7,B9,B9 , suml += x2 * hil 
ADD -L1 A5,A9,A9 ; sum0 += x2 * h2 
MPY -M1X B5,A1,A0 ae 0) ho 
MPY .M2X A0,B6,B6 ;* xl * hl 
LDH :Di *A4++[2],A5 7** x3 = x[J+it3] 
LDH »D2 *B1++[(2],B5 7** xO = x[J+it4] 
ADD .L2X A7,B9,B9 ; suml += x3 * h2 
ADD .L1X B8,A9,A9 ; sum0 += x3 * h3 
[B2] B Sali LOOP 7* branch to loop 
MPY .M2 BO,B6,B7 7* x2 * hi 
MPY .M1 AO,A1,A1 7* xl * ho 
LDH .D2 *B4++[2],A7 7** h2 = h[it2] 
LDH .D1 *A8++[2],B8 7** h3 = h[it3] 
[B2] SUB 192 B2,1,B2 7** decrement loop counter 
ADD .L2 B7,B9,B9 ; suml += x0 * h3 
ADD -L1 AO,A9,A9 7;* sumO += x0 * hO 
MPY ~-M2X A5,B8,B8 * x3 *) hs 
MPY .M1X BO,A7,A5 ;* x2 * h2 
LDH -D2 *B1++[2],BO 7*** x2 = x[Jj+it2] 
LDH .D1 *A4++[2],A0 7*** x1 = x[jt+it1] 
; inner loop branch occurs here 
ADD -L2X A1,B9,B9 7;e suml += xl * hO 
| | ADD sLix B6,A9,A9 ;e sum0 += x1 * hl 
| | MPY .M2 B5,B8,B7 ;e x0 * h3 
| | MPY -M1 A5,A7,A7 7e x3 * h2 
| | SUB +D1 A4,A3,A4 7O reset x pointer to x[j] 
| | SUB .D2 B4,B10,B4 7o reset h pointer to h[0] 
| | [A2] B -S1 OUTLOOP 70 branch to outer loop 
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Example 7-71. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined (Continued) 


ADD D2 B7,B9,B9 7e suml += x2 * hl 
ADD = Eid: A5,A9,A9 7e sum0O += x2 * h2 
LDH -D1 *A4++,B8 7p x0 = x[J5] 
ADD ~L2X A4,4,B1 7° set up pointer to x[j+2] 
ADD ~S1X B4,2,A8 7° set up pointer to h[1] 
MVK -S2 8,B2 7° set up inner loop counter 
ADD .L2X A7,B9,B9 7e suml += x3 * h2 
ADD od lX B8,A9,A9 7e sumO += x3 * h3 
LDH .D2 *B1l++[2],B0 7p x2 = x[Jj+it2] 
LDH Dl *A4++[2],A0 7p Xl = x[jtit+l] 
[A2] SUB souk A2,1,A2 7O decrement outer loop counter 
ADD eli? B7,B9,B9 7e suml += x0 * h3 
SHR one A9,15,A9 7e sum0 >> 15 
LDH sD *A8++[2],B6 7p hl = h[itl] 
LDH D2. *B4++[2],Al 7p hO = hf[i)l 
SHR O2 B9,15,B9 7e suml >> 15 
LDH ~D1 *AR4++[2],A5 7p x3 = x[j+it+3] 
LDH .D2 *B1++[2],B5 7p xO = x[j+it+4] 
STH Da AY, *A6++[2] 7e y[j] = sumO >> 15 
STH D2 B9, *B11++[2] 7e y[jt1l] = suml >> 15 
ZERO eck AQ 7;O zero out sum0 
ZERO ~S2 BY 7O zero out suml 
; outer loop branch occurs here 


7.13.4 Comparing Performance 


The improved cycle count for this loop is 2006 cycles: 50 ((7 x4) + 6 + 6) + 6. The 
outer-loop overhead for this loop has been reduced from 16 to 8 (6 + 6 — 4); 
the —4 represents one iteration less for the inner-loop iteration (Seven instead 
of eight). 


Table 7-26. Comparison of FIR Filter Code 


Code Example Cycles Cycle Count 

Example 7-64 FIR with redundant load elimination 50 (16 xX 2+9+6)+2 2352 

Example 7-69 FIR with redundant load elimination and no memory 50 (8 x 4+10+6)+2 2402 
hits 

Example 7-71 FIR with redundant load elimination andno memory 50(7 x 4+6+6)+6 2006 


hits with outer loop software-pipelined 
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7.14 Outer Loop Conditionally Executed With Inner Loop 


Software pipelining the outer loop improved the outer loop overhead in the 
previous example from 16 cycles to 8 cycles. Executing the outer loop condi- 
tionally and in parallel with the inner loop eliminates the overhead entirely. 


7.14.1 Unrolled FIR Filter C Code 


Example 7-72 shows the same unrolled FIR filter C code that used in the 
previous example. 


Example 7-72. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 


int i, j, sum0, suml; 
short »0,;x1,x2,x3,h0,;h1,h2,h35 


for (j = 0; 3 < 100; jt+=2) { 


sum0O = 0; 

suml = 0; 

x0 = x[J3l]; 

for (i = 0; i < 32; i+=4) { 
xl = x[jtit+l]; 
ho = h[il; 
sum0O += x0 * hO; 
suml += x1 * h0; 
x2 = x[jt+it+2]; 
hl = h[itl]; 
sum0O += x1 * hil; 
suml += x2 * hil; 
x3 = x[Jjt+it+3]; 
h2 = h[it+2]; 
sum0 += x2 * h2; 
suml += x3 * h2; 
xO = x[Jj+it+4]; 
h3 = h[it+3]; 


sum0 += x3 * h3; 
suml += x0 * h3; 
} 

y[j3] = sum0 >> 15; 

y[jt+1] = suml >> 15; 
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7.14.2 Translating C Code to Linear Assembly (Inner Loop) 


Example 7-73 shows alist of linear assembly for the inner loop of the FIR filter 
C code (identical to Example 7-67 on page 7-124). 


Example 7—73. Linear Assembly for Unrolled FIR Inner Loop 


LDH ett, x1 , X1 = x[jtitl1] 

LDH *h++,ho0 ; hoO = h[i] 

MPY x0,h0,p00 ; x0 * hO 

MPY x1,h0,p10 ; xi * ho 

ADD p00, sum0, sum0 ; sumO += x0 * hO 

ADD pl10,suml1,suml ; suml += x1 * hO 

LDH ext KD 7, X¥2 = x[Jt+it2] 

LDH *ht++,hl ; hl = h[itl] 

MPY x1,h1,p01 ; 1 hl 

MPY x2,h1,pl1l1 - #2 * hl 

ADD p01, sum0, sum0 ; sumO += xl * hl 

ADD pll1,suml1,suml , suml += x2 * hl 

LDH *x++,%3 7, X¥3 = x[Jt+it3] 

LDH *ht++,h2 ; h2 = h[it2] 

MPY x2,h2,p02 } 2 * ha 

MPY x3,h2,p12 7, *3 * h2 

ADD p02, sum0, sum0 ; sumO += x2 * h2 

ADD pl2,suml1,suml 3; suml += x3 * h2 

LDH *x++,x0 7, xO = x[j+it4] 

LDH *ht++,h3 ; h3 = h[it3] 

MPY x3,h3,p03 ; “3 * h3 

MPY x0,h3,p13 p x0.* hs 

ADD p03, sum0, sum0 ; sumO += x3 * h3 

ADD p13,suml1,suml 7; suml += x0 * h3 
[cntr] SUB entr,1,cntr 7 Gdecrement loop counter 
[cntr] B LOOP ; branch to loop 
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7.14.3 Translating C Code to Linear Assembly (Outer Loop) 


Example 7—74 shows the instructions that execute all of the outer loop func- 
tions. All of these instructions are conditional on inner loop counters. Two 
different counters are needed, because they must decrement to 0 on different 
iterations. 


_j The resetting of the x and h pointers is conditional on the pointer reset 
counter, prc. 


_) The shifting and storing of the even and odd y elements are conditional on 
the store counter, sctr. 


When these counters are 0, all of the instructions that are conditional on that 
value execute. 


[J The MVK instruction resets the pointers to 8 because after every eight 
iterations of the loop, a new inner loop is completed (8 x 4 elements are 
processed). 


(J) The pointer reset counter becomes 0 first to reset the load pointers, then 
the store counter becomes 0 to shift and store the result. 


Example 7-74. Linear Assembly for FIR Outer Loop 


[sctr SUB sctr,1,setr ; dec store lp cntr 

lectr SHR sum07,15,y0 ; (sum0 >> 15) 

'sctr SHR sum17,15,y1l ; (suml >> 15) 

'sctr STH yO, *yt++[2] 7 ylj] = (sum0 >> 15) 

'sctr STH yl, *y_1++[2] 7 yljt1] = (suml >> 15) 
'sctr MVK 4,sctr ; reset store lp cntr 

[pctr SUB pcetr,1,pctr ; dec pointer reset lp cntr 
!petr SUB x, rEstx2, xX ; reset x ptr 

!petr SUB x_1,rstx1,x_l ; reset x_l ptr 

!petr SUB h, rsthi,h ; reset h ptr 

!petr SUB h_l1,rsth2,h_1 ; reset h_l ptr 

'petr MVK 4,pctr ; reset pointer reset lp cntr 


7.14.4 Unrolled FIR Filter C Code 


The total number of instructions to execute both the inner and outer loops is 
38 (26 for the inner loop and 12 for the outer loop). A 4-cycle loop is no longer 
possible. To avoid slowing down the throughput of the inner loop to reduce the 
outer-loop overhead, you must unroll the FIR filter again. 


Example 7-75 shows the C code for the FIR filter, which operates on eight 
elements every inner loop. Two outer loops are also being processed together, 
as in Example 7—72 on page 7-137. 
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Example 7—75. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 


int i, j, sum0, suml; 


for (j = 0; 3 < 100; jt=2) { 


sum0 = 0; 
suml = 0; 
x0 = x[Jl; 


for (i = 0; i < 32; i+=8){ 
x1 = x[jtit+l]; 
ho = h[il]; 
sum0O += x0 * h0O; 
suml += x1 * h0O; 
x2 = x[Jjt+it+2]; 
hl = h[i+l]; 
sum0 += xl * hil; 
suml += x2 * hl; 
x3 = x[Jjt+it+3]; 
h2 = h[it+2]; 
sum0O += x2 * h2; 
suml += x3 * h2; 
x4 = x[Jjt+it+4]; 
h3 = h[it+3]; 
sum0O += x3 * h3; 
suml += x4 * h3; 
x5 = x[Jjtit+5]; 
h4 = h[it4]; 
sum0 += x4 * h4; 
suml += x5 * h4; 
x6 = x[Jjt+it+6]; 
h5 = h[it+5]; 
sum0O += x5 * h5; 
suml += x6 * h5; 
x7 = x[Jjtit+7]; 
h6 = h[it+6]; 
sum0O += x6 * h6; 
suml += x7 * h6; 
x0 = x[Jjt+it+8]; 
h7 = h[it+7]; 
sum0 += x7 * h7; 
suml += x0 * h7; 
} 

y[j] = sum0 >> 15; 

y[jt1] = suml >> 15; 


short x0,x1,%2,x3,x4,x5,;x6,x7,h0,h1,h2,h3,h4,h5,h6,h7; 
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7.14.5 Translating C Code to Linear Assembly (Inner Loop) 


Example 7—76 shows the instructions that perform the inner and outer loops 
of the FIR filter. These instructions reflect the following modifications: 


_] 
_] 


LDWs are used instead of LDHs to reduce the number of loads in the loop. 
The reset pointer instructions immediately follow the LDW instructions. 


The first ADD instructions for sum0 and sum1 are conditional on the same 
value as the store counter, because when sctr is 0, the end of one inner 
loop has been reached and the first ADD, which adds the previous sum07 
to p00, must not be executed. 


The first ADD for sum0 writes to the same register as the first MPY p00. 
The second ADD reads p00 and p01. At the beginning of each inner loop, 
the first ADD is not performed, so the second ADD correctly reads the 
results of the first two MPYs (p01 and p00) and adds them together. For 
other iterations of the inner loop, the first ADD executes, and the second 
ADD sums the second MPY result (p01) with the running accumulator. The 
same is true for the first and second ADDs of sum1. 
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Example 7—76. Linear Assembly for FIR With Outer Loop Conditionally Executed 


With Inner Loop 


LDW *h++[2],h01 ; 
LDW *h_14++[2],h23 ; 
LDW *h++[2],h45 ; 
LDW -*h_14++[(2)],h67 ; 
LDW ext++[2],x01 ; 
LDW kx 144[2],x23 ; 
LDW *x++[2],x45 ; 
LDW eye 1++[2],x67 ; 
LDH *x, x8 , 
[sctr] SUB sctr,1,s8ctr ; 
[!sctr] SHR sum07,15,y0 . 
[!sctr] SHR sum17,15,yl : 
[!sctr] STH yO, *yt++[2] ; 
[!sctr] STH yl, *y_1++[2] 7 
MV x01,x0lb ; 
MPYLH h01,x01b, p10 ; 
[sctr] ADD pl10,sum17,p10 . 
MPYHL h01,x23,p11 ; 
ADD pll1,p10,suml1l1 ; 
MPYLH h23,x23;,p12 ; 
ADD pl2,suml1l1,suml12 ; 
MPYHL h23,x45,p13 ; 
ADD pl13,suml12,sum13 : 
MPYLH h45,x45,p14 ; 
ADD pl14,sum13,suml14 ; 
MPYHL h45,x67,p15 ; 
ADD pl5,sum14,sum15 7 
MPYLH h67,x67,pl16 7 
ADD p16, sum15,suml16 : 
MPYHL h67,x8,pl7 ; 
ADD pl7,suml16,sum17 ; 
MPY h01,x01,p00 ; 
[sctr] ADD p00, sum07, p00 : 
MPYH h01,x01,p01 ; 
ADD p01,p00, sum01 ; 


h[it+0] 
h[it+2] 
h[it+4] 
h[it+6] 


h[it+l] 
h[it+3] 
h[it+5] 
h[it+7] 


RM KR | 


x[j+it0] 


& x[j+itl 

x[j+tit2] & 
& 
& 


x[j+it3 
x[j+its5 
x[j+it7 


x[j+it4] 
x[j+it6] 
x[j+it3] 


] 
] 
] 
] 


dec store lp cntr 
(sum0 >> 15) 

(suml >> 15) 

y{j] = (sum0 >> 15) 
y[jt1] = (suml >> 15) 


move to other reg fil 


pl0O = h[it+0]*x 
suml(p10) = 


j+itl 


pll = h[itl1]*x 
suml += pll 


j+i+2 


pl2 = 
suml += 


h[it2]*x 
pl2 


3+1i+3 


pl3 = h[it+3]*x 
suml += p13 


5+it4 


pl4 = h[it4]*x 
suml += p14 


jt+it+s 


pl5 = h[it5]*x 
suml += p15 


3+it+6 


pl6é = h[it+6]*x 
suml += pl6 


5+i+7 


pl7 = h[it+7]*x 
suml += pl7 


jt+it+8 


poo = h[it+0]*x 
sum0 (p00) = 


3+1i+0 


pOl = h[itl]*x 
sum0 += pOl 


j+itl 


e 


plO + suml 


p00 + sum0 
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Example 7—76. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (Continued) 


MPY h23,x23,p02 7 p02 = h[it2]*x[j+it+2] 
ADD p02,sum01, sum02 ; sum0 += p02 
MPYH h23,x23,p03 ; pO3 = h[it3]*x[j+i+3] 
ADD p03, sum02, sum03 ; sum0 += p03 
MPY h45,x45,p04 , p04 = h[it4]*x[j+it4] 
ADD p04, sum03,sum04 7 sum0 += p04 
MPYH h45,x45,p05 7 pOS = h[it5]*x[j+it+5] 
ADD p05,sum04,sum05 7 sum0 += p05 
MPY h67,x67,p06 ; p06 = h[it6]*x[j+i+6] 
ADD p06, sum05, sum06 ; sum0 += p06 
MPYH h67,x67,p07 7 pO7 = h[it7]*x[j+it7] 
ADD p07, sum06, sum07 7 sum0 += p07 
'scetr MVK 4,sctr ; reset store lp cntr 
[pctr SUB petr,1,pctr ; dec pointer reset lp cntr 
!petr SUB x,rstx2,x ; reset x ptr 
!petr SUB x_1,rstxl1,x_l 7 reset x_l ptr 
!petr SUB h,xrsthi,h ; reset h ptr 
'pcetr SUB h_l,rsth2,h_1 7 reset h_l ptr 
!petr MVK 4,pctr ; reset pointer reset lp cntr 
foctr SUB octr, 1,ockr ; dec outer lp cntr 
[octr B LOOP 7 Branch outer loop 


7.14.6 Translating C Code to Linear Assembly (Inner Loop and Outer Loop) 


Example 7—77 shows the linear assembly with functional units assigned. (As 
in Example 7-68 on page 7-126, symbolic names now have an A or B in front 
of them to signify the register file where they reside.) Although this allocation 
is one of many possibilities, one goal is to keep the 1X and 2X paths to a 
minimum. Even with this goal, you have five 2X paths and seven 1X paths. 


One requirement that was assumed when the functional units were chosen 
was that all the sum0 values reside on the same side (A in this case) and all 
the sum1 values reside on the other side (B). Because you are scheduling 
eight accumulates for both sum0 and sum1 in an 8-cycle loop, each ADD must 
be scheduled immediately following the previous ADD. Therefore, itis undesir- 
able for any sum0 ADDs to use the same functional units as sum1 ADDs. 


One MV instruction was added to get x01 on the B side for the MPYLH p10 
instruction. 
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Example 7-77. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units) 


-gGlobal _fir 

_fir: -cproc x, h, y 
.reg x1, hal, yl, octr;, pctr, sctr 
.reg sum01, sum02, sum03, sum04, sum05, sum06, sum07 
.reg suml1l, suml2, suml13, suml4, sum15, suml6, suml7 
.reg p00, pO0l, p02, p03, p04, p05, p06, p07 
-reg p10, pill, pl2, p13, pl4, p15, pl6, pl7 
.reg xOlb, x01, «x23, x45, x67, x8, hOl, h23, h45, h67 
.reg yO, yl, rstxl, rstx2, rsthl, rsth2 
ADD x,4,x_1 7 point to x[2] 
ADD h,4,h_1 7 point to h[2] 
ADD y,2,y_l 7 point to y[1] 
MVK 60,rstxl 7 used to rst x pointer each outer loop 
MVK 60,rstx2 ; used to rst x pointer each outer loop 
MVK 64,rsthl ; used to rst h pointer each outer loop 
MVK 64,rsth2 7 used to rst h pointer each outer loop 
MVK 201, -OCEE ; loop ctr = 201 = (100/2) * (32/8) + 1 
MVK 4,pcetr ; pointer reset lp cntr = 32/8 
MVK 5,sctr ; reset store lp cntr = 32/8 + 1 
ZERO sum07 ; sum07 = 0 
ZERO sum17 ; suml7 = 0 
-mptr xX, x+0 
-mptr x_l, xt+4 
-mptr h, h+0 
-mptr h_1l, h+4 

LOOP: trip 8 
LDW -D1T1 *h++[2],hOl ; h[it+O] & h[itl] 
LDW -D2T2 *h_1++[2],h23; h[it+2] & h[it+3] 
LDW sDLTL *ht++[2],h45 ; h[it+4] & h[it5] 
LDW sD2T2 *h_1++[2],h67; h[it+6] & h[it+7] 
LDW -D2T1 *x++[2],x01 7 X[jt+it0] & x[jt+itl] 
LDW «DLiT2 *x 14+4+[2],x23; x[Jj+it2] & x[jt+it3] 
LDW -D2T1 *xt++[2],x45 ; x[jt+it4] & x[jt+it5] 
LDW ~DLT2 kx 14++[2],x67; x[j+it+6] & x[j+i+7] 
LDH 2D2T1 *x, x8 ; x[jt+it+8] 

[sctr] SUB Sal Ssctr,;1,sctr ; dec store lp cntr 

[!sctr] SHR As yl sum07,15,y0 ; (sum0 >> 15) 

[!sctr] SHR -S2 sum17,15,yl ; (suml >> 15) 

[!sctr] STH 2D yO, *yt++[2] 7 ylj] = (sum0 >> 15) 

[!sctr] STH ~D2 yl,*y_1++[2] ; y{fj+1] = (suml >> 15) 
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Example 7—77. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units) (Continued) 


MV «2X x01,x01b 7 move to other reg file 

MPYLH .M2X h01,x01b,p10 ; plO = h[it0O]*x[j+i+1] 
[sctr] ADD ¢ he pl0,sum17,p10 7 suml(pl10) = p10 + suml 

MPYHL .M1X h01,x23,pl11 ; pll = h[itl]*x[j+i+2] 

ADD .L2X pl1,p10,sumi11 7 suml += pll 

MPYLH .M2 h23,.%23;,pl2 ; pl2 = h[it2]*x[j+i+3] 

ADD .L2 pl2,suml11,suml12 ; suml += pl2 

MPYHL .M1X h23,x45,p13 ; pl3 = h[it3]*x[j+i+4] 

ADD -L2X pl3,suml12,suml13 ; suml += pl3 

MPYLH .M1 h45,x45,pl14 ; pl4 = h[it4]*x[j+it+5] 

ADD .L2X pl4,sum13,suml14 ; suml += pl4 

MPYHL .M2X h45,x67,pl5 ; pl5 = h[it5]*x[j+i+6] 

ADD «82 pl5,suml14,sum15 ; suml += pl5 

MPYLH .M2 h67,x67,pl6 ; pl6 = h[it6]*x[j+i+7] 

ADD -L2 pl6,sum15,suml1l6 ; suml += pl6 

MPYHL .-M1X h67,x8,p17 ; pl7 = h[it7]*x[j+i+8] 

ADD .L2X pl7,suml6,suml17 7 suml += pl7 

MPY -M1 hO01,x01,p00 ; pOO = h[it0]*x[j+i+0 
[sctr] ADD -L1 p00, sum07, p00 7 sum0(p00) = pOO + sum0d 

MP YH .M1 hO1,x01,p01 ; pOl = h[itl]*x[j+itl 

ADD Pay ira p01,p00, sum01 7 sum0 += pol 

MPY .M2 h23,x23,p02 ; pO2 = h[it2]*x[ j+it+2 

ADD -L1X p02,sum01, sum02 7 sum0 += p02 

MPYH .M2 h23,x23,p03 ; p03 = h[it3]*x[j+i+3 

ADD «Lix p03, sum02,sum03 ; sum0 += p03 

MPY M1 h45,x45,p04 ; p04 = h[it4]*x[ j+it4 

ADD -L1 p04, sum03, sum04 7 sum0 += p04 

MPYH -M1 h45,x45,p05 ; pOS = h[it5]*x[ j+it5 

ADD Pa) ire p05,sum04,sum05 7 sum0 += p05 
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Example 7-77. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units)(Continued) 


M2 
.L1X 


M2 
.L1X 


Sl 


-S1 
-S2 
-S1 
Sl 
-S2 
-S1 


-S2 
-S2 


h67,x67,p06 
p06, sum05, sum06 


h67,x67,p07 
p07, sum06, sum07 


4,sctr 


pcetr,1,pctr 
X/LStx2;Xx 

x1 ;rstxl, x1 
h,rsthl,h 

hol rsth2]; bh 
4,pctr 


octr, 1,actr 
LOOP 


p06 = h[it6]*x[j+it+6] 
sum0 += p06 


pO7 = h[it7]*x[j+it+7] 
sum0 += p07 


reset store lp cntr 


dec pointer reset lp cntr 


reset x ptr 

reset x_l ptr 

reset h ptr 

reset h_1l ptr 

reset pointer reset lp cntr 


dec outer lp cntr 
Branch outer loop 
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7.14.7 Determining the Minimum Iteration Interval 


Based on Table 7-27, the minimum iteration interval is 8. An iteration interval 
of 8 means that two multiply-accumulates per cycle are still executing. 


Table 7-27. Resource Table for FIR Filter Code 


(a) A side (b) B side 


Unit(s) Total/Unit Unit(s) Total/Unit 


M1 8 .M2 8 
S1 7 S2 6 
.D1 5 .D2 6 
L1 8 sL2 8 
Total non-.M units 20 Total non-.M units 20 


1X paths 2X paths 


7.14.8 Final Assembly 


Example 7—78 shows the final assembly for the FIR filter with the outer loop 
conditionally executing in parallel with the inner loop. 


Optimizing Assembly Code via Linear Assembly 7-147 


Part Ill 


Part Il 


Outer Loop Conditionally Executed With Inner Loop 


Example 7-78. Final Assembly Code for FIR Filter 


[!Al1] 


[!A1] 
[!Al1] 


[!Al1] 


[A2] 


[!A2] 


[A2] 


PYLH 


PYHL 


PYLH 


B4,A0 
B4,4,B2 
A4,Bl 
A4,4,A4 
200,B0 


*A4++[2],B9 
*B1++[2],A10 
4,Al 


*B2++[2],B7 
*A0++[2],A8 
60,A3 
60,B14 


*B1++[2],A11 
*A4++[2],B10 
Al1,1,Al1 
64,A5 

64,B5 
A6,2,B6 


*A0++[2],A9 
*B2++[2],B8 
A4,A3,A4 


B1,B14,Bl 
AO, A5,A0 
*B1,A8 


A10,0,B8 
5,A2 


A8,B8,B4 
B2,B5,B2 
A8,B9,A14 


A8,A10,A7 
B7,B9,B13 
A2,1,A2 
Bll 


B11,15,B11 
B7,B9,B9 
A8,A10,A10 
B4,B11,B4 
*A4++[2],B9 
*B1++[2],A10 
A10 


, 


a 


ae xX [9FLAZ] 
7* x[jt+it0] 


a 


point to h[0] 
point to h[2] 
point to x[j] & x[jt+1] 
point to x[jt+2] & x[j+3] 
set lp ctr ((32/8)*(100/2) ) 


& h[1] 
& h[3] 


x[jtit2] & x[jt+it+3] 
x[j+it0] & x[j+it+1] 
set pointer reset lp cntr 


h[it+2] & h[it+3] 
h[it+0] & h[it+l] 
used to reset x ptr 
used to reset x ptr 


(16*4-4) 
(16*4-4) 


x[j+it4] & x[jt+it5] 
x[jt+it+6] & x[j+it+7] 

dec pointer reset lp cntr 
used to reset h ptr (16*4) 
used to reset h ptr (16*4) 
point to y[jt1] 


h[it4] & h[it+5] 
h[{it+6] & h[it+7] 
reset x ptr 


reset x ptr 
reset h ptr 
x[jt+it+8] 


move to other reg file 
set store lp cntr 


p10 = h[it+0]*x[Jj+itl 
reset h ptr 
pll = h[itl1]*x[j+it2 


poo h[it+0]*x[j+i+0] 
pl2 = h[it+2]*x[j+it+3 
dec store lp cntr 

zero out initial accumulator 


(Bsuml >> 15) 

p02 = h[it2]*x[j+it2] 

p0Ol = h[itl]*x[j+it+1] 
sum1(p10) = p10 + suml 

& x[j+it3] 

& x[j+itl] 

zero out initial accumulator 


7-148 


Outer Loop Conditionally Executed With Inner Loop 


Example 7-78. Final Assembly Code for FIR Filter (Continued) 


LOOP: 
[!A2] 
[BO] 


[A2] 


[!Al1] 


[!Al1] 
[!Al1] 


[!A2] 


[!A2] 
[!A2] 


[!Al1] 
[!Al1] 


-S1 
-S2 
.M2 
-L1 
.M1X 
~L2X 
-D2 
2D 


2b 
-M2X 
M1 
~L2 
-D2 
Dsl 
-S1 


-S2 
M1 
.L1X 
.M2 
~L2X 
sD 
-D2 
-S1 


-M2 
M1 
.L1X 
~L2X 
~S2 
oe 
-D2 


-S1 
M2 
-L1 
.M1X 
-S2 
-D2 
-D1 
.L2X 


ob 
-L2 
.M2X 
-S1 
-S2 
.M1X 


A10,15,A12 
BO,1,B0 
B7,B9,B13 
A7,A10,A7 
B7,A11,A10 
A14,B4,B7 
*B2++[2],B7 
*A0++[2],A8 


A10,A7,A13 
A9,B10,B12 
A9,A11,A10 
B13,B7,B7 
*B1l++[2],A11 
*A4++[2],B10 
Al,1,Al1 


LOOP 
A9,A11,A11 
B9,A13,A13 
B8,B10,B13 
A10,B7,B7 
*A0++[2],A9 
*B2++[2],B8 
A4,A3,A4 


B8,B10,B11 
A9,A11,A11 
B13,A13,A9 
A10,B7,B7 
B1,B14,Bl1 
AO,A5,A0 
*B1,A8 


4,A2 
B8,B10,B13 
A11,A9,A9 
B8,A8,A9 
B12,B7,B10 
B11, *B6++[2] 
A12,*A6++[2] 
A10,0,B8 


A11,A9,A12 
B13,B10,B8 
A8,B8,B4 
4,Al 
B2,B5,B2 
A8,B9,A14 


’ 


’ 
’ 


’ 


’ 
’ 


’ 


’ 
ra 


’ 


’ 


, 
’ 


ra 


;* reset pointer reset lp cntr 


’ 


’ 


;* reset x ptr 


(Asum0 >> 15) 

dec outer lp cntr 

p03 = h[it3]*x[j+it+3] 
sum0 (p00) = p0OO + sum0 
pl3 = h[it3]*x[j+it4] 
suml += pll 
* h[it+2] & h[it3] 
* h[it+0O] & h[itl] 


sum0 += pOl 
pl5 = h[it5]*x[j+it+6] 
pl4 = h[it4]*x[j+it+5] 
suml += pl2 


;* x[jtit4] & x[j+it+5] 
;* x[jtit6] & x[j+it+7] 
;* dec pointer reset lp cntr 


Branch outer loop 

p04 = h[it4]*x[j+it4] 
sum0 += p02 
pl6 = h[it6]*x[j+it+7] 
suml += p13 


,* hflit4] & h[it5] 
;* hflite] & h[it7] 


p06 = h[it6]*x[j+i+6] 
p0OS = h[it5]*x[j+it5] 
sum0 += p03 
suml += p14 


;* reset x ptr 
;* reset h ptr 
;* x[jt+it+s8] 


reset store lp cntr 

pO7 = h[it7]*x[j+it+7] 
sum0 += p04 

pl7 = h[it7]*x[j+it8] 
suml += p15 

y[jt1] = (Bsuml >> 15) 
ylj]l = (Asum0 >> 15) 

* move to other reg file 


sum0 += p05 
suml += pl6 
* p10 = h[it+0]*x[jt+it1] 


* reset h ptr 
* pll = h[itl]*x[jt+it+2] 
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Example 7-78. Final Assembly Code for FIR Filter (Continued) 


ADD ~L2X A9,B8,B11 , suml += pl7 
ADD .L1X B11,A12,A12 7 sum0 += p06 
MPY M1 A8,A10,A7 7* pOO = h[it0O]*x[j+i+0] 
MPYLH .M2 B7,B9,B13 7* pl2 = h[it2]*x[j+it3] 
[A2] SUB Ok A2,1,A2 7* dec store lp cntr 
ADD .L1xX B13,A12,A10 ; sum0 += p07 
[!A2] SHR +82 B11,15,B11 ;* (Bsuml >> 15) 
MPY .M2 B7,B9,B9 7* pO2 = h[it2]*x[j+it2] 
MPYH M1 A8,A10,A10 7* pOl = h[itl]*x[j+it1l] 
[A2] ADD ~L2 B4,B11,B4 7* suml(p10) = p10 + suml 
LDW .D1 *A4++[2],B9 7** x[jtit2] & x[j+it+3] 
LDW D2 *B1++[2],A10 7** x[j+it0] & x[jt+itl] 
;Branch occurs here 
{[!A2] SHR 2S A10,15,A12 ; (Asum0 >> 15) 
{!A2] STH .D2 B11, *B6++[2] > y[jt+1] = (Bsuml >> 15) 
||[!a2] STH .D1 Al12,*A6++[2] + yl] = (Asum0 >> 15) 


7.14.9 Comparing Performance 


The cycle count of this code is 1612: 50 (8 x 4+ 0) +12. The overhead due 
to the outer loop has been completely eliminated. 


Table 7-28. Comparison of FIR Filter Code 


Code Example Cycles Cycle Count 
Example 7-61 FIR with redundant load elimination 50 (16 xX 2+9+6)+2 2352 
Example 7-69 FIR with redundant load elimination and no memory 50 (8 x 4+ 10+6)+2 2402 
hits 
Example 7-71 FIR with redundant load elimination and no memory 50(7 xX 4+6+6)+6 2006 
hits with outer loop software-pipelined 
Example 7—74 FIR with redundant load elimination and no memory 50 (8 x 4+0)+12 1612 
hits with outer loop conditionally executed with inner 
loop 
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Interrupts 


This chapter describes interrupts from a software-programming point of view. 
A description of single and multiple register assignment is included, followed 
by code generation of interruptible code and finally, descriptions of interrupt 
subroutines. 
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8.1 Overview of Interrupts 


8-2 


An interrupt is an event that stops the current process in the CPU so that the 
CPU can attend to the task needing completion because of another event. 
These events are external to the core CPU but may originate on-chip or off- 
chip. Examples of on-chip interrupt sources include timers, serial ports, DMAs 
and external memory stalls. Examples of off-chip interrupt sources include 
analog-to-digital converters, host controllers and other peripheral devices. 


Typically, DSPs compute different algorithms very quickly within an asynchro- 
nous system environment. Asynchronous systems must be able to control the 
DSP based on events outside of the DSP core. Because certain events can 
have higher priority than algorithms already executing on the DSP, it is some- 
times necessary to change, or interrupt, the task currently executing on the 
DSP. 


The 'C6x provides hardware interrupts that allow this to occur automatically. 
Once an interrupt is taken, an interrupt subroutine performs certain tasks or 
actions, as required by the event. Servicing an interrupt involves switching 
contexts while saving all state of the machine. Thus, upon return from the inter- 
rupt, operation of the interrupted algorithm is resumed as if there had been no 
interrupt. Saving state involves saving various registers upon entry to the inter- 
rupt subroutine and then restoring them to their original state upon exit. 


This chapter focuses on the software issues associated with interrupts. The 
hardware description of interrupt operation is fully described in the 
TMS320C6x CPU and Instruction Set Reference Guide. 


In order to understand the software issues of interrupts, we must talk about two 
types of code: the code that is interrupted and the interrupt subroutine, which 
performs the tasks required by the interrupt. The following sections provide in- 
formation on: 


41 Single and multiple assignment of registers 

_j Loop interruptibility 

1 Howto use the’C6x code generation tools to satisfy different requirements 
Lj Interrupt subroutines 
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8.2 Single Assignment vs. Multiple Assignment 


Register allocation on the ’C6x can be classified as either single assignment 
or multiple assignment. Single assignment code is interruptible; multiple as- 
signment is not interruptible. This section discusses the differences between 
each and explains why only single assignment is interruptible. 


Example 8-1 shows multiple assignment code. The term multiple assignment 
means that a particular register has been assigned with more than one value 
(in this case 2 values). On cycle 4, at the beginning of the ADD instruction, reg- 
ister A1 is assigned to two different values. One value, written by the SUB in- 
struction on cycle 1, already resides in the register. The second value is called 
an in-flight value and is assigned by the LDW instruction on cycle 2. Because 
the LDW instruction does not actually write a value into register A1 until the end 
of cycle 6, the assignment is considered in-flight. 


In-flight operations cause code to be uninterruptible due to unpredictability. 
Take, forexample, the case where an interruptis taken on cycle 3. At this point, 
all instructions which have begun execution are allowed to complete and no 
new instructions execute. So, 3 cycles after the interrupt is taken on cycle 3, 
the LDW instruction writes to A1. After the interrupt service routine has been 
processed, program execution continues on cycle 4 with the ADD instruction. 
In this case, the ADD reads register A1 and will be reading the result of the 
LDW, whereas normally the result of the SUB should be read. This unpredict- 
ability means that in order to ensure correct operation, multiple assignment 
code should not be interrupted and is thus, considered uninterruptible. 


Example 81. Code With Multiple Assignment of A1 


1 


ya om fF W DN 


cycle 


SUB 
LDW 
NOP 
ADD 
NOP 


MPY 


-S1 


D1 


A4,A5,A1 


*A0,Al 


; writes to Al in single cycle 


; writes to Al after 4 delay slots 


7h A1l,A2,A3 ; uses old Al (result of SUB) 


-M1 Al1,A4,A5 ; uses new Al (result of LDW) 


2 


Example 8—2 shows the same code with a new register allocation to produce 
single assignment code. Now the LDW assigns a value to register A6 instead 
of A1. Now, regardless of whether an interrupt is taken or not, A1 maintains 
the value written by the SUB instruction because LDW now writes to A6. Be- 
cause there are no in-flight registers that are read before an in-flight instruction 
completes, this code is interruptible. 
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Example 8-2. Code Using Single Assignment 


cycle 
1 SUB «81 A4,A5,Al ; writes to Al in single cycle 
2 LDW -D1 *A0,A6 ; writes to Al after 4 delay slots 
3 NOP 
4 ADD ~L1 A1l,A2,A3 ; uses old Al (result of SUB) 
5-6 NOP 2 
7 MPY -M1 A6,A4,A5 ; uses new Al (result of LDW) 


Both examples involve exactly the same schedule of instructions. The only dif- 
ference is the register allocation. The single assignment register allocation, as 
shown in Example 8-2, can result in higher register pressure (Example 8-2 
uses one more register than Example 8-1). 


The next section describes how to generate interruptible and non-interruptible 
code with the ’C6x code generation tools. 
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8.3 Interruptible Loops 


Even if code employs single assignment, it may not be interruptible in a loop. 
Because the delay slots of all branch operations are protected from interrupts 
in hardware, all interrupts remain pending as long as the CPU has a pending 
branch. Since the branch instruction on the ’C6x has 5 delay slots, loops small- 
er than 6 cycles always have a pending branch. For this reason, allloops small- 
er than 6 cycles are uninterruptible. 


There are two options for making a loop with an iteration interval less than 6 
interruptible. 


1) Simply slow down the loop and force an iteration interval of 6 cycles. This 
is not always desirable since there will be a performance degradation. 


2) Unroll the loop until an iteration interval of 6 or greater is achieved. This 
ensures at least the same performance level and in some cases can im- 
prove performance (see section 7.9, Loop Unrolling and section 8.4.4, 
Getting the Most Performance Out of Interruptible Code). The disadvan- 
tage is that code size increases. 


The next section describes how to automatically generate these different op- 
tions with the ‘C6x code generation tools. 
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8.4 


8.4.1 
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Interruptible Code Generation 


The 'C6x code generation tools provide a large degree of flexibility for interrup- 
tibility. Various combinations of single and multiple assignment code can be 
generated automatically to provide the best tradeoff in interruptibility and per- 
formance for each part of an application. In most cases, code performance is 
not affected by interruptibility, but there are some exceptions: 


(1 Software pipelined loops that have high register pressure can fail to allo- 
cate registers at a given iteration interval when single assignment is re- 
quired, but might otherwise succeed to allocate if multiple assignment 
were allowed. This can result in a larger iteration interval for single assign- 
ment software pipelined loops and thus lower performance. To determine 
if this is a problem for looped code, use the -mw feedback option. If you 
see a “Cannot allocate machine registers” message after the message 
about searching for a software pipeline schedule, then you have a register 
2pressure problem. 


(J Because loops with minimum iteration intervals less than 6 are not inter- 
ruptible, higher iteration intervals might be used which results in lower per- 
formance. Unrolling the loop, however, prevents this reduction in perfor- 
mance (See section 8.4.4.) 


(1 Higher register pressure in single assignment can cause data spilling to 
memory in both looped code and non-looped code when there are not 
enough registers to store all temporary values. This reduces performance 
but occurs rarely and only in extreme cases. 


The tools provide 3 levels of control to the user. These levels are described in 
the following sections. For a full description of interruptible code generation, 
see the TMS320C6x Optimizing C Compiler User’s Guide. 


Level 0 - Specified Code is Guaranteed to Not Be Interrupted 


The compiler does not disable interrupts. Thus, itis up to you to guarantee that 
no interrupts occur. This level has the advantage that the compiler is allowed 
to use multiple assignment code and generate the minimum iteration intervals 
for software pipelined loops. 


The command line option -mi (no value specified) can be used for an entire 
module and the following pragma can be used to force this level on a particular 
function: 


#pragma FUNC_INTERRUPT_THRESHOLD (func, uint_max) ; 


Interruptible Code Generation 


8.4.2 Level 1 — Specified Code Interruptible at All Times 


The compiler will not disable interrupts. Thus, the compiler will employ single 
assignment everywhere and will never produce a loop of less than 6 cycles. 
The command line option —mi1 can be used for an entire module and the fol- 
lowing pragma can be used to force this level on a particular function: 


#pragma FUNC_INTERRUPT_THRESHOLD (func, 1); 


8.4.3 Level 2 — Specified Code Interruptible Within Threshold Cycles 


The compiler will disable interrupts around loops if the specified threshold 
number is not exceeded. In other words, the user can specify a threshold, or 
maximum interrupt delay, that allows the compiler to use multiple assignment 
in loops that do not exceed this threshold. The code outside of loops can have 
interrupts disabled and also use multiple assignment as long as the threshold 
of uninterruptible cycles is not exceeded. If the compiler cannot determine the 
loop count of aloop, then it assumes the threshold is exceeded and will gener- 
ate an interruptible loop. 


The commandline option —mi (threshold) can be used for an entire module and 
the following pragma can be used to specify a threshold for a particular func- 
tion. 


#pragma FUNC_INTERRUPT_THRESHOLD (func, threshold); 
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8.4.4 Getting the Most Performance Out of Interruptible Code 


As stated in Chapter 4 and Chapter 7, the .trip directive and the _nassert intrin- 
sic can be used to specify a maximum value for the trip count of a loop. This 


information can help to prevent performance loss when your loops need to be 
interruptible as in Example 8-3. 


For example, if your application has an interrupt threshold of 100 cycles, you 
will use the -mi100 option when compiling your application. Assume that there 
is a dot product routine in your application as follows: 


Example 8-3. Dot Product With _nassert Guaranteeing Minimum Trip Count 


int dot_prod(short *a, short *b, int n) 
{ 
int i, sum = 0; 
_nassert (n >= 20); 
for (i = 0; i < nj i++) 
sum += ali] * b[il; 


return sum; 


With the _nassert intrinsic, the compiler only knows that this loop will execute 
at least 20 times. Even with the interrupt threshold set at 100 by the -mi option, 
the compiler will still produce a 6-cycle loop for this code (with only one result 
computed during those six cycles) because the compiler has to expect that a 
value of greater than 100 may be passed into n. 


After looking atthe application, you discover that n will never be passed a value 
greater than 50 in the dot product routine. Example 8—4 adds this information 
to the _nassert statement as follows: 
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Example 8—4. Dot Product With _nassert Guaranteeing Trip Count Range 


int dot_prod(short *a, short *b, int n) 
{ 


int i, sum = 0; 
_nassert ((n >= 20) && (n <= 50)); 
for (i = 0; i < n; itt) 


sum += a[i] * b[i]; 


return sum; 


Now the compiler knows that the loop will complete in less than 100 cycles 
when it generates a 1-cycle kernel that must execute 50 times (which equals 
50 cycles). The total cycle count of the loop is now known to be less than the 
interrupt threshold, so the compiler will generate the optimal 1-cycle kernel 
loop. You can do the same thing in linear assembly code by specifying both 
the minimum and maximum trip counts with the .trip directive. 


Note: The compiler does not take memory bank conflicts into account. Be- 
cause of this it is recommended that you are conservative with the threshold 
value. 


Let us now assume the worst case scenario - the application needs to be inter- 
ruptible at any given cycle. In this case, you will build your application with an 
interrupt threshold of one. It is still possible to regain some performance lost 
from setting the interrupt threshold to one. Example 8—5 shows this is where 
the factor option in .trip and using the modulus operator in an_nassert intrinsic 
are useful. (Refer to section 4.3.3.4, Loop Unrolling .) 


Interrupts 8-9 


Part Ill 


Part Ill 


Interruptible Code Generation 


Example 8-5. Dot Product With _nassert Guaranteeing Trip Count Range and Factor of 2 


int dot_prod(short 


*a, 


short *b, 


{ 


int i, sum = 


((n 


0; 
_nassert >= 
for i <n; 
sum += a[i] 


return sum; 


20) 


&& (n <= 


i++) 


* blil; 


int n) 


50) && ( 


By enabling unrolling, performance has doubled from one result per 6-cycle 
kernel to two results per 6-cycle kernel. By allowing the compiler to maximize 
unrolling when using the interrupt threshold of one, you can get most of the 


performance back. Example 8-6 shows a dot product loop that will execute a 
factor of 4 between 16 and 48 times. 


Example 8-6. Dot Product With _nassert Guaranteeing Trip Count Range and Factor of 4 


return sum; 


int dot_prod(short 


xa, short *b, int n) 
{ 
int i, sum = 0; 
_nassert ((n >= 16) && (n <= 48) && ((n%4)==0)); 
for (i = O; i < n; i++) 
sum += ali] * b[il; 
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The compiler knows that the trip count is some factor of four. The compiler will 
unroll this loop such that four iterations of the loop (four results are calculated) 
occur during the six cycle loop kernel. This is an improvement of four times 
over the first attempt at building the code with an interrupt threshold of one. The 
one drawback of unrolling the code is that code size increases, so using this 
type of optimization should only be done on key loops. 
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8.5 Interrupt Subroutines 


The interrupt subroutine (ISR) is simply the routine, or function, that is called 
by an interrupt. The ’C6x provides hardware to automatically branch to this 
routine when an interrupt is received based on an interrupt service table. (See 
the Interrupt Service Table in the TMS320C6x CPU and Instruction Set Refer- 
ence Guide.) Once the branch is complete, execution begins at the first exe- 
cute packet of the ISR. 


Certain state must be saved upon entry to an ISR in order to ensure program 
accuracy upon return from the interrupt. For this reason, all registers that are 
used by the ISR must be saved to memory, preferably a stack pointed to by 
a general purpose register acting as a stack pointer. Then, upon return, all val- 
ues must be restored. This is all handled automatically by the C compiler, but 
must be done manually when writing hand-coded assembly. 


8.5.1. ISR with the C Compiler 


The C compiler automatically generates ISRs with the keyword interrupt. The 
interrupt function must be declared with no arguments and should return void. 
For example: 


interrupt void int_handler() 
{ 


unsigned int flags; 


} 


Alternatively, you can use the interrupt pragma to define a function to be an 
ISR: 


#pragma INTERRUPT (func) ; 


The result either case is that the C compiler automatically creates a function 
that obeys all the requirements for an ISR. These are different from the calling 
convention of a normal C function in the following ways: 


J All general purpose registers used by the subroutine must be saved to the 
stack. If another function is called from the ISR, then all the registers 
(AO—A15, BO—B15) are saved to the stack. 


_) ABIRP instruction is used to return from the interrupt subroutine instead 
of the B B3 instruction used for standard C functions 


.j) A function cannot return a value and thus, must be declared void. 


See the section on Register Conventions in the TMS320C6x Optimizing C 
Compiler User’s Guide for more information on standard function calling con- 
ventions. 
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8.5.2 ISR with Hand-Coded Assembly 


When writing an ISR by hand, it is necessary to handle the same tasks the C 


compiler does. So, the following steps must be taken: 


i) 


All registers used must be saved to the stack before modification. For this 
reason, itis preferable to maintain one general purpose register to be used 
as a stack pointer in your application. (The C compiler uses B15.) 


If another C routine is called from the ISR (with an assembly branch in- 
struction to the _c_func_name label) then all registers must be saved to 


the stack on entry. 


A B IRP instruction must be used to return from the routine. If this is the 


NMI ISR, aB NRP must be used instead. 


An NOP 4 is required after the last LDW in this case to ensure that BO is 


restored before returning from the interrupt. 


Example 8-7. Hand-Coded Assembly ISR 


STW 
STW 
STW 
STW 
STW 
STW 


LDW 
LDW 
LDW 
LDW 
LDW 
|| B 
LDW 
NOP 


BO, *B15-- 
AO, *B15-- 
Bl, *B15-- 
B22; *BiL5== 
B3;*BLS== 
B4, *B15-- 


* End of ISR code 


*++B15,B4 
*++B15,B3 
*++B15,B2 
*++B15,B1 
*++B15,A0 
IRP 
*++B15,BO 
4 


Ne Ne Ne Ne Ne Ne 


* Beginning of ISR code 


Ne Ne Ne Ne Ne Ne Ne Ne Ne 


* Assume Register BO-B4 & AO are the only registers used by the 
* ISR and no other functions are called 


store BO to stack 
store AO to stack 
store Bl to stack 
store B2 to stack 
store B3 to stack 
store B4 to stack 


restore B4 

restore B3 

restore B2 

restore Bl 

restore AO 

return from interrupt 

restore BO 

allow all multi-cycle instructions 
to complete before branch is taken 


8.5.3 Nested Interrupts 
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Sometimes it is desirable to allow higher priority interrupts to interrupt lower 
priority ISRs. To allow nested interrupts to occur, you must first save the IRP, 
IER, and CSR to a register which is not being used or to or some other memory 
location (usually the stack). Once these have been saved, you can reenable 
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the appropriate interrupts. This involves resetting the GIE bit and then doing 
any necessary modifications to the IER, providing only certain interrupts are 
allowed to interrupt the particular ISR. On return from the ISR, the original val- 
ues of the IRP, IER, and CSR must be restored. 


Example 8-8. Hand-Coded Assembly ISR Allowing Nesting of Interrupts 


* Assume Register BO-B4 & AO are the only registers used by the 
* ISR and no other functions are called 


STW BO, *BiLSs== ; store BO to stack 

MVC IRP, BO ; Save IRP 

STW AO, *B15-- ; store AO to stack 

MVC IER, Bl ; Save IER 

MVK mask, AO ; setup a new IER (if desirable) 
STW Bl, *BL3<= ; store Bl to stack 

MVC AO, IER ; setup a new IER (if desirable) 
STW B2, *B15-— ; store B2 to stack 

MVC CSR, AO ; read current CSR 

STW BS; *BLS=<= ; store B3 to stack 

OR 1,A0,A0 ; set GIE bit field in CSR 

STW B4, *B15-—— ; store B4 to stack 

MVC AO,CSR ; write new CSR with GIE enabled 
STW BO, *B15--— ; store BO to stack (contains IRP) 
STW Bl, *B15--— ; store Bl to stack (contains IER) 
STW AO, *B15-- 7 store AO to stack (original CSR) 


* Beginning of ISR code 


* End of ISR code 


LDW *++B15,A0 ; restore AO (original CSR) 
LDW *++RB15, BL ; restore Bl (contains IER) 
LDW *++B15, BO ; restore BO (contains IRP) 
LDW *++B15,B4 ; restore B4 
LDW *+4+B15,B3 ; restore B3 

| | MVC A0O,CSR ; restore original CSR 
LDW *4+4+B15,B2 ; restore B2 

|| mvc BO, IRP ; restore original IRP 
LDW *+4+B15,B1 ; restore Bl 

| | MVC Bl, IER ; restore original IER 
LDW *+4+B15,A0 ; restore AO 

|| B IRP ; return from interrupt 
LDW *++B15,B0 ; restore BO 
NOP 4 ; allow all multi-cycle instructions 


; to complete before branch is taken 
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Memory Alias Disambiguation 


This appendix is a tutorial and practical treatment on the problem of memory 
alias disambiguation on the ’Céx. If you write ’C6x linear assembly or hand- 
coded assembly, you will gain direct practical knowledge and advice on how 
to use the tools to handle this problem. If you write in C, you will gain insight 
into how the compiler handles this problem, as well as some practical advice. 


The keywords to keep in mind are: memory aliases, dependence graphs, in- 
struction scheduling 
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A.1 Overview 
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Memory alias disambiguation analyzes whether a dependence, through a 
memory location, exists between a given pair of instructions. Dependences 
between instructions are then used to determine the fastest legal schedule of 
those instructions. 


This appendix begins by covering the topic of dependence. Next is a descrip- 
tion of how dependences are represented in dependence graphs. These con- 
cepts are then extended to cover loops. Then, it addresses how dependence 
affects instruction scheduling. Next, the term memory alias disambiguation is 
introduced. 


The focus then shifts to how the tools, particularly the assembly optimizer, han- 
dle memory alias disambiguation. However, if you write hand-coded assem- 
bly, you will find some useful concepts in these sections. Several detailed ex- 
amples are presented. 


Two final sections discuss how the C compiler handles memory alias disambi- 
guation, and the differences between memory alias disambiguation and 
memory bank conflict detection. 


Note that this appendix describes the ’C6x code generation tools for release 
2.10 or greater. 


Background 


A.2 Background 


A.2.1 Data Dependence Between Instructions 


One dictionary definition of dependence is ’the state of being determined, in- 
fluenced, or controlled by something else”. In the world of software, the objects 
being influenced can be modules of code, specific functions, blocks within 
functions, individual statements, data structures, variables, etc. Further, the 
relationship can be interdependent, for example, two objects can depend on 
each other. This appendix refers to only one kind of dependence relationship: 
the data dependence between individual assembly language instructions. 


At this level, dependence is evaluated between pairs of instructions. Two in- 
structions have a dependence when they reference (read or write) the same 
machine resource, for example, register, memory location, status bit, and so 
forth. So, a dependence is characterized by the following pieces of informa- 
tion: 


The first instruction 

The second instruction 

The resource both instructions reference 

The first instruction reference - read or write? 
The second instruction reference - read or write? 


HOUUUU 


This information is summarized in the following table. The entries in the table 
are the formal name for that form of dependence. 


Table A-1. Dependence Table 


Instruction 2 Reference 


write Flow ouput 


Flow dependence is the most common and intuitive form of dependence. In 
this relationship, one instruction writes an output which a following instruction 
reads as an input. For example: 


Instruction 1 Reference 


Tl: ADDK 10,A2 ; writes output to A2 


r23 STW A2,*A3 ; reads input from A2 


Instruction I1 writes an output in the register A2, and instruction 12 reads A2 
as an input. 
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Anti-dependence is less common than flow dependence, but it is no less im- 
portant. In this relationship, one instruction reads a resource as an input, and 
a following instruction writes a result to that same resource. For example: 


E3:¢ STWA3, *A4 ; reads A4 


14: ZERO A4 ; writes A4 


Instruction 13 reads A4 for a data address, while instruction 14 clears out A4 
for some later computation. 


Note a key difference from flow dependence: the anti-dependence exists be- 
cause of the reuse of the resource, and not because of a transfer of actual data. 
So, one easy way to remove an anti-dependence is to choose a different re- 
source in the second instruction. In this example, instruction I 4 could use the 
register A5 instead: 


T3:$ STWA3, *A4 ; reads A4 


I4s: ZERO A5 ; writes A5 ==> no anti-dependence 


Since anti-dependence through a register is so easy to avoid, it is less com- 
mon. However, anti-dependence through a memory location is usually not as 
easy to rewrite. 


Output dependence is also not very common. One example is using a register 
to pass a value to a function. You will see a register load followed by a branch 
to a function which is known or presumed to overwrite that same register. 


T5383 LDW*A8,A4 ; load A4 


I6: B func ; branch to func ==> overwrites A4 


The relationship between these two instructions is an output dependence. 


Input dependence is common, but it is usually ignored. One exception is ac- 
cessing memory mapped peripherals. In that case, reading a memory location 
can trigger a side effect such as incrementing register, or starting a memory 
block transfer. You generally want to recognize a dependence between any 
instruction which triggers such side effects and any other memory reference. 


The term independent is used to describe two instructions which do not refer- 
ence any of the same resources. Note the difference between the terms inde- 
pendent and anti-dependence. 


A.2.2 Dependence Graphs 


Dependence graphs are used to represent the dependences between a set 
of instructions. 


Background 


From the C fragment: 


a=b+t+c; 


d-=e+f; 


Here is hand modified compiler output which illustrates the serial instruction 
stream for those statements: 


-. 


5 a=bt+oc; 


, 


LDW*+DP (_b),B4 ; |5| 


LDW*+DP (_c),B7 ; |5| 
ADDB7,B4,B4 ; |5| 


STWB4,*+DP(_a) ; |5| 


, 


7; 6 d=e+ f; 


a 


LDW*+DP(_e),B5 ; |6| 


LDW*+DP (_f),B6 ; |6| 
ADDB6,B5,B4 ; |6| 
STWB4,*+DP(_d) ; |6| 
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Here is the dependence graph: 


LDW LDW 


*:DP(_b) 
B4 


*+DP(_f) 
B6 


The circles are called nodes, and the arrows connecting the circles are called 
edges. There is one node for each instruction, and one edge for each depen- 
dence. Since instructions can have multiple dependences as both input and 
output, nodes in the graph can have multiple edges leading in and out. 


With regard to a single edge, the node at the head of the arrow is termed the 
“parent” of the “child” node at the tail of the arrow. 


The instruction is written immediately over the corresponding node. 


For loads and stores, both operands are written inside the node. For other in- 
structions, only the result operand is written, because the input operands are 
the result operands from the parent nodes. 


Background 


The numbers next to the edges indicate how many cycles of pipeline latency 
you must wait for that result to be available to the child node. A latency of 0 
indicates those instructions can be scheduled in parallel. 


A common misconception is to imagine data flowing along the edges. That is 
true for the common case of flow dependence (thus the name). But note the 
edge from the first STW to the second ADD. That is an anti-dependence on B4. 
No data is flowing in this dependence. The dependence is based on the reuse 
of register B4. Anti-dependence is shown in the graph with a boldface arrow. 
All of the other edges are flow dependences. In flow dependence, the associat- 
ed operand is always the last (or only) operand shown in the parent node. For 
anti-dependence, the associated operand is the first (or only) operand shown 
in the child node. 


In other literature, a node may be called a vertex (vertices for plural), and an 
edge may be called an arc. 


If you are accustomed to the dependence graphs that appear in Chapter 7 of 
the ’C6x Programmer’s Guide, you will notice some differences. The graphs 
are called dependency graphs. An edge is called a path. Only one operand is 
shown in load/store instructions, and anti-dependence is not addressed. 


A.2.3. Data Dependence in Loops 


So far, we have only looked at relationships between instructions in a simple 
straight-line block of code. Considering dependence between instructions in 
a loop requires some extensions to those concepts. 


A dependence graph for instructions in a loop looks the same, but there is a 
key difference. Each node, instead of representing one instruction, now repre- 
sents every instance of that instruction in every loop iteration. The same is true 
of the edges. 


When considering dependence graphs in straight-line code, you do not have 
to worry about the direction of the dependence, because it is always the same: 
from an earlier serial instruction to a later one. In loops, however, an instruction 
late in the loop can generate a result which is used, in the next loop iteration, 
by an instruction earlier in the loop. We say such dependences are carried by 
the loop. In that case, the edge in the dependence graph goes the other direc- 
tion. Here is a linear assembly code fragment: 
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loop: 
LDH*xptrt++,xi 
MPYc1,xi,p0O 
MPYc2,yi,pl ; reads yi from prior iteration 
ADDpO,pl, sum 
SHRsum,15,yi; writes yi for next iteration 
STHyi, *yptrt+t+ 
; decrement and branch to loop 
Here is the dependence graph: 
LDH 


ne) 


Background 


Consider the instructions which reference yi. Note the flow dependence, car- 
ried to the next iteration by the loop, on yi from the SHR to MPY, because the 
SHR writes a value to yi which mPy reads. Also note the anti-dependence, not 
carried by the loop, on yi from the mpy to the SHR, because the MPY must read 
yi before SHR writes to it. Note how the two dependences are in opposite di- 
rections. 


Nodes which are not loads and have no parents (c1, c2, 15) are either invariant 
in the loop or constants. No latency is shown by the edge since the operand 
is always available. 


This appendix only examines simple loops which contain no other loops. Data 
dependence in the presence of nested loops is beyond the scope of this chap- 
ter. With regard to the differences introduced by nested loops, the ’C6x assem- 
bly optimizer capability and features, as well as this appendix, work together 
to provide you with a conservatively safe solution. Thatis, the solutions we pro- 
vide are generally optimal for simple loops, and safe, though sometimes less 
than optimal, for nested loops. The C compiler, on the other hand, performs 
very sophisticated dependence analysis on nested loops. 


A.2.4 How Dependence Affects Instruction Scheduling 


Instruction scheduling is solving the problem of choosing a schedule for a seri- 
al stream of instructions which satisfies two competing constraints: it pre- 
serves the computational effect of the serial instructions, for example, the code 
still works, and, it has the best performance. 


Instruction scheduling algorithms are built around one central concept: while 
you do not have to honor the serial instruction order, you do have to honor the 
order imposed by the instruction dependences. 


We can examine this concept at the C statement level. Take the C fragment 
presented earlier and simply swap the order of the statements: 


d= e + £7 


a=b+t+c; 


It is obvious this will generate the same answer. Why? Because the two state- 
ments are independent; they do not reference any of the same variables. Con- 
sider this fragment ... 

x= y+ az; /* #1 */ 

z=xt+1; /* #2 */ 


Obviously, if you reorder these statements, you will get a different answer. 
Consider the variable x. Statement 1 writes a value to x which statement 2 
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reads; a flow dependence on x. Consider the variable z. Statement 1 reads 
a value from z while statement 2 writes a value to z; an anti-dependence on 
z. Either dependence alone prevents reordering the statements. 


It may be somewhat surprising that the forms of dependence, flow or anti- or 
output, all have the same effect on the statement order. In every case, the de- 
pendence constrains those statements to be in that order. 


Input dependence is ignored with regard to scheduling; the order you read 
from memory usually does not matter. Instances where you may be concerned 
about input dependence include considering the effect on cache behavior, or 
accessing memory mapped peripherals. 


These same ideas transfer directly to assembly language instructions. Instruc- 
tions which are independent can be reordered, instructions which have one or 
more dependences cannot be reordered. Further, the latencies associated 
with the dependences must be honored. 


Since dependences force instruction orderings, it follows that fewer depen- 
dences mean fewer constraints on instruction orderings. Put another way, few- 
er dependences mean more degrees of freedom in choosing an instruction 
schedule. On a chip architecture like the ‘C6x, which has many opportunities 
for parallelism in combination with a deep pipeline, you can never have too 
much freedom in choosing an instruction schedule. 


The details of how instruction scheduling algorithms really work is also beyond 
the scope of this appendix. But here is the compiler generated schedule for the 
original C fragment presented earlier: 

LDW .D2T2*+DP(_b),B4 ; |5| 

LDW .D2T2*+DP(_c),B7 ; |5| 

LDW .D2T2*+DP(_f),B6 ; |6| 


LDW .D2T2*+DP(_e),B5 ; |6| 


DD .L2 B7,B4,B4 ; |5| 


DD .L2 B6,B5,B4 ; |6| 


N 
A 
STW .D2T2B4,*+DP(_a) ; |5| 
A 
S 


W .D2T2B4,*+DP(_d) ; |6| 


Note how the instructions from the two C statements are interspersed. The 
load statements are scheduled early, to better hide the latency of a load. The 
rest of the instructions are scheduled as soon as the latencies of the instruc- 
tions they depend on are satisfied. Use the dependence graph from the earlier 
section as a guide. 


Background 


A.2.5 Memory Alias Disambiguation Defined 


This concept has an analogue in computer programs. When there are two (or 
more) different ways to reference a memory location, we say there are aliases 
to that memory location. 


Given this linear assembly fragment: 


I7: LDW*A4,A2 
; other instructions 


I8: STWA3, *A6 


do A4 and Aé reference the same memory location or not? If they do, they are 
memory aliases to that memory location. If they are memory aliases, then 
these two instructions have an anti-dependence between them; the read must 
occur before the write. In the instruction schedule, this dependence means 
those instructions must remain in that order. 


On the other hand, if A4 and Aé do not reference the same memory location, 
they are not memory aliases. The instructions are independent. In the instruc- 
tion schedule, these instructions can safely be placed in any order. 


Note that unlike an anti-dependence on registers, there is no way to rewrite 
these instructions to remove the anti-dependence. 


How can you determine whether *A4 is an alias for *A6 or not? Given the infor- 
mation we have here, you cannot. Thus, we call this an ambiguous alias. 
Maybe it is alias, maybe it is not. 


Memory alias disambiguation, then, is the process of determining whether any 
given pair of memory references are aliases to one another. The outcome of 
that process is one of three states: 


State Means 

Not aliases Instructions are independent. 

Are aliases Instructions have a dependence between them. x 
Still ambiguous Tool convention or user information is needed. : 


If a dependence is found, it imposes an ordering on the instruction schedule. 
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A.3 Tools Solution 


A.3.1 Overview of the Assembly Optimizer Solution 


First, the assembly optimizer attempts to automatically disambiguate as many 
memory references as possible. For all remaining memory references, the de- 
fault is to presume they may access the same memory location, i.e. they are 
aliases. While that presumption is safe for all input, it is usually too conserva- 
tive. So, a command line switch (-mt) and a function level directive 
(.no_mdep) can reverse that presumption, i.e. presume that any ambiguous 
aliases do not access the same memory location. If you have a small number 
of possible aliases in your code, you can use an additional directive (.mdep) 
to mark those instruction pairs. 


A.3.2 Automatic Disambiguation 


So, what can the assembly optimizer disambiguate automatically? 


When you have memory references that use the same base register but differ- 
ent constant offsets, with no intervening modification of that base register, the 
assembly optimizer will recognize those references cannot possibly access 
the same memory location. For example ... 


19: LDW*AR2 [4] ,A3 ; not an alias to 110 


, no changes to AR2 


110: LDW*AR2[5],A4 ; not an alias to 19 


Note the trivial case of 0 offset, e.g. *AR2, is included in this analysis. 
Any other combination of memory references is an ambiguous alias. 


User information about aliases, whether in the form of command line options 
or directives, is used only after automatic disambiguation fails to yield an ans- 
wer. You cannot override any automatic disambiguation, whether the answer 
is “not aliases” or “are aliases”. Implications of this policy are detailed later. 


A.3.3 Default Presumption is Pessimistic 


The default presumption, any ambiguous alias must be an alias, is a worst 
case, or pessimistic, assumption. While it is common to have instructions in 
your linear assembly which possibly access the same memory location, it rela- 
tively rare for that possibility to come true. Still, this pessimistic assumption is 
key to giving you the ability to balance correct handling of memory aliases with 
good performance. 


Tools Solution 


The pessimistic assumption can have a drastic effect on software pipelining. 
Many linear assembly loops fit this general form: 
loop: 
Ill: LDW*pl1++, inpl 
; compute something into outp2 


T12: STWoutp2, *p2++ 


, decrement and branch to loop 


Under the default assumption, p1 and p2 may reference the same memory 
location. This means two more dependence edges are added to the depen- 
dence graph: an anti-dependence edge between 111 and 112, and a flow de- 
pendence edge between 112 and 111. 


LDW 


When a dependence edge is associated with a memory reference, and not a 
register operand, you will see a triangle imposed over the edge. 


These dependences mean those instructions must remain in that order for ev- 
ery loop iteration. Because these are the first and (nearly the) last instructions 
in the loop, and they cannot be moved past each other, software pipelining is 
all but completely disabled. 


However, in many cases, p1 and p2 point to completely different arrays, and 
thus never reference the same memory location. So, what is a user to do? 
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A.3.4 Change the Default Presumption to Optimistic 


There are two methods for changing the presumption to the optimistic view 
that ambiguous memory aliases never access the same location. You can use 
a command line option: 


cl6x -mt 
Or, you can use a function level directive: 
.-no_mdep 


The command line option affects every function in your linear assembly file. 
The .no_mdep directive can only appear within the . (c) proc/.endproc 
block of a linear assembly function, and affects only that function. 


If you are certain you have no memory aliases in your code, then switching to 
the optimistic assumption is all you need to do. This will be a common case. 
If you ever do have a memory alias in your code, now you know how to handle 
it: get this appendix out again. 


Many users will want to switch to the optimistic assumption, except for a small 
number aliases they know about in their code. If that is you, the solution is to 
switch to the optimistic assumption, and then use the . mdep directive to mark 
those few aliases you have. 


A.3.5 Using .mdep to Mark Aliases 


Marking an instance of a memory alias is a two step process. First you attach 
symbolic names to your memory references in the linear assembly stream: 


LDW*pl++{ldl}, inpl ; name memory reference "1dl1” 


STWoutp2, *p2++{st1} ; name memory reference “st1” 


The names in the “{}” are assembly symbols like any other. You cannot use the 
same symbol as a memory reference name and a symbolic register. Then you 
note the specific memory dependence: 


-mdep ldi,stl 


This means whenever 1d1 references some memory location X, at some later 
time in code execution, st 1 may also reference location X. This is equivalent 
to adding an edge between these two instructions in the dependence graph. 
In terms of the instruction schedule, these instructions must remain in that or- 
der. The 1d1 reference must always occur before the st1 reference. 


Recall how the direction of a given dependence is important when considering 
loops. The direction implied by .mdep is from the first named memory refer- 
ence to the second; in this case, from 1d1 to st 1. The opposite direction, from 
sti to 1d1, is not implied. 


Examples of Memory Alias Disambiguation 


A.4 Examples of Memory Alias Disambiguation 


A.4.1_ How .mdep Affects Instruction Scheduling 


The following are some complete examples. This example illustrates how an 
.mdep may, or may not, affect the instruction schedule. It also shows how the 
direction of a dependence, as indicated by the order of the operands to the 


.mdep directive, affects the instruction schedule. 


Full understanding of all the examples presumes an understanding of the gen- 
eral concepts of software pipelining. For background information on software 


pipelining, see Chapter 7. 


This linear assembly function is adapted from the weighted vector sum exam- 


ple. A typical call to this function could look like: 

-call wvs(a, b, c, m) 

Here is the source: 

WwvVs: -cproc aptr, bptr, cptr, m 
-reg cntr, ai, bi, pi, pi_scaled, ci 
MVK100,cntr 


.no_mdep ; presume no memory aliases 


loop: .trip 100 


LDH*aptrt+t+,ai 


LDH*bptrt++,bi 
MPYm,ai,pi 
SHRpi,15,pi_scaled 
ADDpi_scaled,bi,ci 
STHci, *cptr++ 

[entr]SUBcntr,1,cntr 

[cntr]B loop 


-endproc 
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Here is the dependence graph (without the decrement and branch instruc- 
tions): 


LDH 


pi_scaled 


The assembly optimizer generates a 2-cycle loop for this code, which is opti- 
mal for this input. 


Suppose you know some calls to wvs pass the same array as the b input array 
and the c output array, just to save some space: 


.call wvs(a, b, b, m) 
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So, every loop iteration reads an element from the b array, and then immedi- 
ately writes a result back to that same element. The correct way to model that 
is: 


WwvVs: -cproc aptr, bptr, cptr, m 
-reg cntr, ai, bi, pi, pi_scaled, ci 
MVK100,cntr 
-no_mdep ; presume no memory aliases 
-mdep b_load,c_store ; except this one 


loop: .trip 100 


LDH*aptr {a_load},ai 


LDH*bptr {b_load},bi 
MPYm, ai,pi 
SHRpi,15,pi_scaled 
ADDpi_scaled,bi,ci 
STHci,*cptr++ {c_store} 

[entr]SUBcntr,1,cntr 

[cntr]B loop 


-endproc 
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Here is the dependence graph: 
LDH 


{a_load} 


pi_scaled {b_load} 


{c_store} 


Note the addition of the anti-dependence memory edge between the b_load 
and the c_store. The assembly optimizer still generates a 2-cycle loop for 
this code; the addition of the .mdep makes no difference. Why? 


Well, there is already a chain of flow dependences, through registers, from 
b_loadto c_store, and that chain of dependences imposes an ordering on 
the instructions in the chain. So, the instruction ordering constraint imposed 
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by the anti-dependence edge is no different than the constraints already im- 
posed by the flow dependence chain. Therefore, the instruction schedule 
doesn’t change. 


Or, you can think about it strictly in terms of the new anti-dependence memory 
edge. It means every time b_loadreferences memory location X, at some /at- 
ertime in execution, c_store may also reference location X. This means that 
each b_load must occur before each c_store. More importantly, it also 
means each c_store does not have to occur before each b_load. So, the 
b_1load for the next loop iteration can start before the c_store from the pre- 
vious iteration finishes. Here is an illustration of the software pipeline where 
each iteration is ina separate column, and instructions which can run in paral- 
lel are on the same horizontal line: 


Iteration n Iteration n+1 


LDH b_load 
LDH b_load 
STH c_store 


STH c_store 


Well, the software pipeline was structured like that before the .mdep. So, no 
change. 


While it makes for a contrived example, consider what happens if you call wvs 
like this: 


ADD b,2,¢ ; ¢C points to b[1] 


-call wvs(a, b, c, m) 


So, c_store writes its result to an array element which b_1oad reads on the 
next loop iteration. Here is the correct way to model that: 


-no_mdep ; presume no memory aliases 
-mdep c_store,b_load ; except this one 


<exactly as before> 


Note the .mdepis the same as the previous example, except the operands are 
reversed. 
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Here is the dependence graph: 
LDH 


{a_load} 


pi_scaled {b_load} 


{c_store} 


Now, instead of the anti-dependence memory edge, there is a flow depen- 
dence memory edge from c_store to b_load. Note this dependence is car- 
ried by the loop. Now the assembly optimizer generates a 7 cycle loop. Why? 


Recall the chain of flow dependences, on registers, from the b_1load to the 
c_store. Now that chain is extended, and carried by the loop, to the b_load 
for the next iteration. Before you can start that b_1oad for the next loop itera- 


Examples of Memory Alias Disambiguation 


tion, you have to wait for the c_store from the present iteration to finish. Here 
is how the software pipeline looks: 


‘Iterationn Iteration n+1 
LDH b_load 
STH c_store 
LDH b_load 
STH c_store 


So, as you can see, the direction, as implied by the operand order, is a very 
important characteristic of an .mdep. 


A.4.2_ Handling Indexed Addressing Mode 


Indexed addressing, e.g. *+A4[A5], typically means you are accessing 
memory without any clear pattern. How should you handle this case? 


Here is an example ... 
histogram: .cproc inptr, hptr, len 

.reg idx, count 

-no_mdep ; no memory aliases 
loop: .trip 8 


LDHU *inptr++,idx 


LDW*+hptr[idx],count 
ADDcount, 1, count 


STWcount, *+thptr [idx] 


[len] SUBlen,1,len 
[len] B loop 


-endproc 
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Here is the dependence graph: 


LDHU 


*+hptr[idx] 
count 


count’ 
*+hptr[idx’] 


Note the assembly optimizer splits the lifetime of the idx register by adding 
the Mv instruction; the new variable is shown as idx’. The lifetime of count 
is similarly split at the ADD instruction. 


The assembly optimizer generates a 2 cycle loop, but it will not work. Why? 
This loop is computing, into the array hpt r, a histogram of all the values in the 
array inptr. What if the value 10 occurs in the inptr array two times ina 
row? In that case, the location *hpt r [10] is incremented on successive loop 
iterations. Look at the dependence graph. Do you see a dependence edge 
from the STw to the LDW? No? Well, that is the problem. The LDw for the next 
loop iteration has permission to get started without waiting for the STw from the 
previous loop iteration, which it does. To fix this situation we add the .mdep: 
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histogram: .cproc inptr, hptr, len 


-reg idx, count 


-no_mdep 7 no memory aliases 
-mdep h_st, h_ld ; except this one 
loop: .trip 8 


LDHU *inptrt++, idx 
LDW*+hptr[idx] {h_ld},count 


ADDcount, 1, count 
STWcount, *t+thptr[idx] {h_st} 


[len] SUBlen,1,len 
[len] B loop 
-endproc 


Here is the dependence graph: 
LDHU 


*+hptr[idx] 
count 


“\ 


{ 


*+hptr[idx’] 


Note the flow dependence memory edge, carried by the loop, from the STw to 
the LDw. Now the assembly optimizer generates a 7 cycle loop. Much slower, 
but it works. 
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Do you need the .mdep in the other direction, from the h_1d to the h_st? If 
you simply want the code to run, and you do not care why, then the answer is 
no. Because of the chain of flow dependences on registers from h_1dto h_st, 
this ... 


-no_mdep ; no memory aliases 
-mdep h_st, h_ld ; except this one 
.-mdep h_ld, h_st ; and this one 


does not change the instruction schedule. But the dependence actually does 
exist, So itis advisable to add this . mdep because it makes the code self-docu- 
menting. 


In the face of indexed addressing, you may be tempted to just rely on the de- 
fault pessimistic assumption. Be careful. In this case, that will hand you a 
13-cycle loop. Why? Because under the pessimistic assumption a depen- 
dence is recognized from the STw (at the bottom of the loop) to the LDHu (at 
the top of the loop). That means the load at the top of iteration n+1 cannot start 
until the store at the end of iteration n is finished. 


A.4.3 Peripherals Access Example 


Recall that you cannot override any automatic disambiguation performed by 
the assembly optimizer. If it can determine that two memory references (must/ 
must not) access the same memory location it (will/will not) recognize a depen- 
dence between the associated instructions. This is true despite any command 
line options or .mdep directives which may be in effect. This means there is 
no way to guarantee the assembly optimizer will use a particular pattern of ac- 
cess to memory. In general code, this is preferable behavior. But it can be an 
issue when you consider code which accesses memory mapped peripherals. 
Here is an example: 


7 base of multi-channel buffered serial port 0 
MCBSPO_BASE .set 0x018C0000 
mcbspO_dxr: .cproc 

MVKMCBSPO_BASE,B4 ; load base of McBSPO regs 


MVKH MCBSPO_BASE,B4 


n 


TWB5, *+B4 (0x10) ; init XCR for transfer 


n 


TWB6, *+B4 (0x4) ; transfer word through DXR 


-endproc 
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It is clear that the two STW memory references are not accessing the same 
memory location; they are using the same base register with different offsets. 
So, even if you use: 


-mdep wrt_xcr,wrt_dxr 


STWB5, *+B4(0x10) {wrt_xcr} 


STWB6, *+B4(0x4) {wrt_dxr} 


the assembly optimizer may still reorder the writes. In general code, this is fine, 
and often an improvement. But when accessing peripherals like a serial port, 
or whenever writing to a memory location can trigger a side effect, reordering 
the memory references is wrong. 


Presently, there are two ways to solve this problem. You can write the code in 
C, being careful to use the keyword “volatile” for any memory reference which 
has a side effect. Or, you can bypass the assembly optimizer by writing these 
routines in hand-coded assembly. 
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A.5 C Compiler and Alias Disambiguation 


The C compiler provides several methods, both command line options and in 
the source, for addressing the problem of memory alias disambiguation. Hav- 
ing read this appendix, you should now have a much better understanding of 
the issue. This section will briefly review each of these methods. For all the de- 
tails, consult either this book or the TMS320C6x Opimizing C Compiler User’s 
Guide. 


The compiler does a much better job of alias disambiguation than the assem- 
bly optimizer. C source provides much more information to work with. So, the 
default presumption on aliases which cannot be disambiguated is the pessi- 
mistic one: they are aliases. 


Still, there are a few very esoteric cases of memory aliases which the compiler 
presumes do not occur. If your code violates those presumptions use: 


cl6x -ma 


On rare occasions, you may need it. 


The command line option: 
cl6éx -mt 


means something different to the compiler than it does to the assembly opti- 
mizer. As presented already, this option reverses the assembly optimizer’s 
pessimistic assumption that memory references it cannot disambiguate must 
be aliases. To the compiler, this same option means several specific instances 
of memory aliases do not occur in your C code. 


The command line options: 
cl6x -pm -o3 


have several effects, of which improved alias disambiguation only one. These 
options work together to provide program level optimization. The —pm option 
combines all of your source files into one intermediate file, and -03 carries out 
the program level optimization on that intermediate file. Seeing all the func- 
tions at once yields optimization opportunities which generally do not occur 
otherwise. If the compiler can see all the calls to this function: 


void foo(int *pl, int *p2) 


it can easily determine that the same array is never passed in for p1 and p2, 
and therefore p1 and p2 references are not aliases. 


Aggressive, but correct, use of the const keyword helps the alias disambi- 
guation problem. The const keyword tells the compiler that data object will 
not be modified. Not even by an alias. So, any const qualified memory read 
cannot possibly be an alias to a memory write. If an alias does modify a const 
object, that is a user bug. 


Memory Alias Disambiguation versus Memory Bank Conflict Detection 


A.6 Memory Alias Disambiguation versus Memory Bank Conflict Detection 


The problem is ... 


The answer affects ... 


Get it wrong and ... 


Occurrences of ... 


If a ’C6x execute packet (a set of instructions which execute in parallel) in- 
cludes two references to memory, and both of those references are to the 
same memory bank, because each bank is single-ported memory, the pipeline 
stalls for one cycle while the second memory word is accessed. This is called 
abank conflict, and itis obviously worth avoiding. The assembly optimizer pro- 
vides a directive called .mpt r for the purpose of solving this problem. See the 
TMS320C6x Optimizing C Compiler User’s Guide for all the details. 


It is easy to confuse the topic of memory alias disambiguation with memory 
bank conflict detection. The terms sound similar. And they are both concerned 
with how memory references affect the instruction schedule. But there are 
some striking differences ... 


Alias Disambiguation Bank Conflict Detection 


Whether two memory references access Whether two memory references access 
exactly the same location the same memory bank 


The ordering constraints imposed on the Whether to schedule a pair of memory 


instruction schedule references in parallel 

Your code does not work The execute packet takes 1 cycle longer 

Memory aliases are relatively rare Potential memory bank conflicts are 
common 


You have to solve the problem of memory alias disambiguation before you can 
consider the problem of memory bank conflict detection. One of the presump- 
tions of memory bank conflict detection is that the two memory references can 
be scheduled in parallel. That is true only if you have already determined the 
instructions are independent; they are not aliases to one another. 


In your linear assembly, itis best to simply keep these problems, and their solu- 
tions, entirely separate. Enter your .mdep directives without any regard to 
your .mptr directives, and vice versa. 
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A.7 Summary 


[1 Dependence is a relationship between two instructions which read or write 
the same machine resource. 


(1 Dependence graphs represent the dependence between instructions. 
Nodes (circles) are instructions. Edges (arrows) are dependences. Data 
often flows along edges, but not always. 


_j In loops, nodes represent every instance of that instruction in every loop 
iteration, and dependence direction is important. 


Dependences force an ordering on the instruction schedule. 


Generally, the fewer the dependences, the better the schedule. 


Aliases imply a dependence between the associated instructions. 


i) 
| 
_} Multiple references to the same memory location are called aliases. 
| 
| 


Memory alias disambiguation is the process of determining whether a pair 
of references are aliases, for example, whether a dependence is recog- 
nized between the instructions. 


[1 The assembly optimizer automatically disambiguates some references, 
then uses command line options and directives to disambiguate the re- 
maining references. 


(1 The default presumption for remaining aliases is pessimistic; they are 
aliases. 


_j The assembly optimizer command line option: 
cl6ox -mt 


reverses the default presumption to optimistic; they are not aliases. It applies 
to all functions in the file. 


(1 The function level directive ... 
-no_mdep 


also changes the presumption to optimistic, but applies only to the function 
which contains it. 


(1 To mark a specific memory dependence, first annotate memory refer- 
ences ... 


LDW*pl++{ldl}, inpl ; name memory reference "1d1” 


Summary 


STWoutp2, *p2t++{st1l} ; name memory reference "st1” 
Then note the specific dependence: 
-mdep ldi,stl 


[J You cannot force the assembly optimizer to recognize a dependence be- 
tween instructions, which can be an issue when accessing memory 
mapped peripherals. 


_) The C compiler offers the user several methods for influencing the han- 
dling of memory aliases 


_) Do not confuse memory alias disambiguation with memory bank conflict 
detection. Solve the problems separately. 


Memory Alias Disambiguation A-29 
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A-30 


Index 


FIR filter with redugcant load 


elimination 17 
if-then-else 7-93]|7-100, 
[] in assembly code IR filter [-86 
@ symbol in assembly output live-too-long, with move instructions 
|| (parallel bars) in assembly code /|6-2 weighted ined 
a intetinad functional units in 
underscore) in intrinsics |4-13 
-( ) [1g instructions in 
labels in (6-2 
linear 
dot product, fixed-point 
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if-then-else FIR filter, unrolled 
IIR filter [7-83 if-then-else 
in writing parallel code {7-11 HR filter 
live-too-long resolution _|7-107 live-too-long 7-108 
weighted vector sum weighted vector sum 
AND instruction, mask for {7-75 cdocdaee tesa 
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assembler directives parallel bars in 
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directives in [6-4| for dot product 
dot product, fixed-point tutorial 
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dot product, fixed-point cE} 
7-56 
dot product, floating-point big-endian mode 
FIR filter [7-121,]7-130,]7-134]to|7-137,|7-148 and MPY operation 
to 7-151 runtime support (rts6201e.lib) [2-4 
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Index 


biquad filter 
inner loop kernel 
assembly from C with intrinsics 
linear assembly 
original assembly code 
linear assembly 
original C code 
with word instructions and intrinsics [2-21] 
branch target, for software-pipelined dot 
product 


branching to create if-then-else |7-87 


C code 
analyzing performance of 
basic vector sum 
dot product 

fixed-point 


floating-point 

FIR filter 4-38]]7-111] 
inner loop completely unrolled 4-39 
optimized form 


unrolled 


with redundant load elimination [7-112 


if-then-else 
7-78 


IIR filter 
live-too-long 
refining (phase 2 of flow), in flow diagram |1-3 
saturated add 
trip counters |4-33 
vector sum 

with const keywords }-9 


with const keywords, _nassert, word 
reads 


with const keywords, _nassert, word reads, 
unrolled 
with three memory operations 


word-aligned 1-36 

weighted vector sum 
unrolled version \7-60 

writing 

C_OPTIONS environment variable |2-6 

*C6x mnemonics |6-5 

char data type 

child node |7-11 

cl6x command 4-4 

clock ( ) function 
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code development flow diagram 
phase 1: develop C code_|1-3||2-15]to|2-17, 
phase 2: refine C code |1-3||2-18|to 
phase 3: write linear assembly 2-26}to 2-31] 

code development steps _ [8-2 

code documentation 

comments in assembly code 

compiler options 


conditional execution of outer loop with inner 
loop 
conditional instructions to execute if-then-else [7-88] 
conditional SUB instruction [7-30] 
conditions in assembly code _ {6-3 
const keyword 
in vector sum 
constant operands _ 6-8} 
.cproc directive 
CPU elements 
cycle count 
for biquad filter 2-30 
for functions in demo1.c 
for multiply accumulate 
for vector multiply 
formula for calculating 


datatypes /4-2 
demo1.c example code 
demo2.c example code 
demo3.c example code 
dependency graph 
dot product, fixed-point 
dot product, fixed-point 
parallel execution 7-16 
with LDW 
dot product, floating-point, with LDW 
drawing {7-11 
steps in 
FIR filter 
with arrays aligned on same loop 
cycle 
with no memory hits 


with redundant load elimination 7-7 4 


7-107 
7-66 


resolved 


vector sum 
weighted 


with const keywords 4-G 
weighted vector sum 


destination operand 


dot product 


C code 
fixed-point [7-9 
translated to linear assembly, 
fixed-point V-10 
with intrinsics 4-20} 
dependency graph of basic 
fixed-point 
assembly code with LDW before software 
pipelining 
assembly code with no extraneous 
loads 
assembly code with no prolog or epilog V-59 
assembly code with smallest code size \7-56 


C code with loop unrolling V-20 

dependency graph of parallel assembly 
code F-14 

dependency graph with LDW 

fully pipelined 

linear assembly for full code 

linear assembly for inner loop with conditional 
SUB instruction 


linear assembly for inner loop with 
Low 


linear assembly for inner loop with LDW and 
allocated resources [7-25 
nonparallel assembly code V-1 
parallel assembly code 7-16 
floating-point 
assembly code with LDW before software 
pipelining 
assembly code with no extraneous 
loads 
assembly code with no prolog or epilog 7-54 
assembly code with smallest code size 
assembly code, fully pipelined_-44 
C code with loop unrolling 
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linear assembly for inner loop with conditional 
SUB instruction_ 7-32 

fully pipelined 

linear assembly for full code 

linear assembly for inner loop with 
LDW 

linear assembly for inner loop with LDW and 
allocated resources 

word accesses in 


double data type 


-endproc directive [2-26] 

epilog 

execute packet 
execution cycles, reducing number of 


extraneous instructions, removing 
SUB instruction 


feedback, from compiler or assembly optimizer 
File menu (debugger) |2-12 
FIR filter 


C code 


optimized form 
unrolled 17-140 
with inner loop unrolled 
with redundant load elimination P12 
final assembly 
for inner loop 
with redundant load elimination 7-11 7 
with redundant load elimination, no memory 
hits 
with redundant load elimination, no memor. 
hits, outer loop software-pipelined 
linear assembly 
for inner loop Y-13 
for outer loop V-139 
for unrolled inner loop 
for unrolled inner loop with .mptr 
directive 
with inner loop unrolled 
with outer loop conditionally executed with 
inner loop 
software pipelining the outer loop 
using word access in 
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with inner loop unrolled {7-123 


fixed-point, dot product 


linear assembly for inner loop with LDW 


linear assembly for inner loop with LDW and 
allocated resources 


float datatype |4-2 
floating-point, dot product 
dependency graph with LDW_ {7-26 


linear assembly for inner loop with LDDW 
linear assembly for inner loop with LDDW with 


allocated resources 
flow diagram, code development |1-3 
functional units 
description 
in assembly code 


reassigning for parallel execution 


functions 
clock() |4-2 
printf () |4-2 


-g option 


if-then-else 
branching versus conditional instructions 
C code 
final assembly 
linear assembly 7-911 
IIR filter, C code 
iir1.asm, inner loop kernel |2-17 
iirt.c example code 
in-flight value {8-3 
information elements in tutorial 


inserting moves 


instructions, placement in assembly code 


intdata type |4-2 
interrupt subroutines 0|8-14 
hand-coded assembly allowing nested 
interrupts 
nested interrupts 
with hand-coded assembly 
with the C compiler |8-11| 


interrupts 
overview 
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single assignment versus multiple 
assignment B-3|to [8-4] 
intrinsics 
_add2 () 
_mpy () 
_mpyh () 
_mpyhl () 
_mpylh () 
_nassert 


summary table |4-14]to|4-16 
iteration interval, defined |7-33 


—k compiler option 

kernel 
loop 
of iirl.asm code {2-17 
of iir2.asm code 
of iir3.asm code 
of mac1.asm code 
of vec_mpy1.asm code 
of vec_mpy2.asm code 


—l linker option 
labels in assembly code _ |6-2 
linear, optimizing (phase 3 of flow), in flow 
diagram iz 
linear assembly 
code 
dot product, fixed-point V-10 
dot product, fixed-point 


dot : roduct, floating-point 


FIR filter (7-413, [-115, 


FIR filter with outer loop conditionally 


executed with inner loop 7-144 


FIR filter, outer loop 

FIR filter, unrolled 

if-then-else 

live-too-long \7-108 

weighted vector sum 
resource allocation 

conflicts 


dot product V-24 

if-then-else 

IIR filter V-83 

in writing parallelcode Y-1 

live-too-long resolution 7-107) 

weighted vector sum 7-63 
linker command file 


little-endian mode, runtime support 
(rts6201.lib) |2-6 


little-endian mode, and MPY operation [7-22] 


live-too-long 
code 


C code 7-104 


inserting move (MV) instructions 
F106 


unrolling the loop 
issues 
and software pipelining 1-43 
created by split-join ng 
load 
doubleword (LDDW) instruction [7-20] 
word (LDW) instruction 
Load Program File dialog box (debugger) 
load6x 
long datatype 4-2 
loop 


carry path, described 
counter, handling odd-numbered 


iterations 
kernel |2-15 
unrolling 


dot product 
for simple loop structure 
if-then-else code 


2-12 


7-95 
in FIR filter 


in live-too-long solution 


in vectorsum 4-36 


maci.asm kernel, inner loop 
mac1.c example code 


memory bank scheme, interleaved |7-119|to|7-121 


memory dependency. See dependency 
—mg compiler option 


minimum iteration interval, determining |7-35 
for FIR code |7-115}/7-129]|7-147 
for if-then-else code 7-98 


for IIR code [7-81] 


for live-too-long code 
for weighted vector sum 
mnemonic (instruction) |6-4 
modulo iteration interval table 
dot product, fixed-point 
after software pipelining V-36 
before software pipelining 
dot product, floating-point 
after software pipelining V-37 
before software pipelining 
IIR filter, 4-cycle loop ir-8al 
weighted vector sum 
2-cycle loop 7-70 
with SHR instructions 7-67 
modulo-scheduling technique, multicycle 
loops 
move (MV) instruction 


_mpy renee ea 
tutorial |2-19 
_mpyh ( ) intrinsic 
_mpyhl intrinsic 
_mpylh intrinsic 
tutorial 
multicycle instruction, staggered 
accumulation 
multiple assignment, code example 8-3 
multiply accumulate function 
inner loop kernel of original assembly 


code 


original C code {2-3 
—mw compiler option 


_nassert intrinsic 


node {7-11 


—o compiler option 
-o linker option 
operands 
placement in assembly code 
types of 


optimization checklist [3-1] to B-8] 


optimizing assembly code, introduction 
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optional tasks in tutorial 
outer loop conditionally executed with inner 


loop 
OUTLOOP 


parallel bars, in assembly code 
parent instruction [7-11] 
parent node [7-11] 
path in dependency graph 
performance analysis 
of C code 
of dot product examples 


of FIR filter code |7-129}|7-136]|7-150) 

of if-then-else code 
pipeline in ’C6x |1-2 
—pm compiler option 4-4} [4-6 [4-11] [4-34] 
pointer operands 6-8 
preparation for tutorial 
primary tasks in tutorial 
priming the loop, described 
priming the pipeline |4-33 
printf () function 
processor mnemonics |6-5 


Profile 
Marking dialog box 
menu (debugger) 


Run dialog box 
profiling [2-8}to 
program-level optimization 


prolog 


pseudo-code, for single-cycle accumulator with 
ADDSP 


redundant 


load elimination 
loops 
.reg directive 
register 


allocation |7-128 
operands [6-8 
related documentation — iv 
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resource 
conflicts 


described 
live-too-long issues 


table 
FIR filter code 7-415, 7-129 
if-then-else code_V 5 
IIR filter code 


live-too-long code 
rts6201.lib file 2-6 
rts6201e.lib file |2-6 


ssaextension [2-26] 
_sadd intrinsic 
scheduling table. See modulo iteration interval table 
shell program (cl6x) 
short SE 
arrays |4-18 
oes 
single assignment, code example 8-4] 
software pipeline 
accumulation, staggered results due to 3-cycle 
delay |7-39 
described 
when not used 
software-pipelined schedule, creating 
source operands 
split-join path 
stand-alone simulator (load6x) 4-2 
SunOS shell initialization 


symbolic names, for data and pointers 


techniques 
for priming the loop 
for refining C code |4-13 
for removing extra instructions 
using intrinsics 
word access for short data |4-18 
TMS320C6x pipeline |1-2 
translating C code to ’C6x instructions 
dot product 


fixed-point, unrolled V-21 
floating-point, unrolled \7-23 


IIR filter 
with reduced loop i / 


weighted vector sum 
unrolled inner loop V-61 
translating C code to linear assembly, dot product, 
fixed-point 
trip count 
communicating information to the compiler 
determining the minimum 


trip counter, defined 


trip directive [2-26] 


vec_mpy1.asm kernel, inner loop 
vec_mpy1.c example code 
vector multiply function 
C with word instructions and intrinsics 2-19 
inner loop kernel 
of assembly from C with intrinsics 
of original assembly code ~-16 
original C code 
tutorial C code example (vec_mpy1.c) |2-4 
vector sum function 
See also weighted vector sum 
C code 
with const keyword h-g 
with const keywords, _nassert, word 
reads 7d 
with const keywords, _nassert, word reads, 
and loop unrolling 


Index 


with const keywords, nassert, and word reads 
(generic) }-19 
with three memory operations 
word-aligned 
compiler output (original assembly code) |4-10 
dependency graph 
handling odd-numbered loop counter with |4-19 
handling short-aligned data with 
rewriting to use word accesses 
VelociTl {1-2 


very long instruction word (VLIW) {1-2 


weighted vector sum 
C code 
unrolled version 
final assembly 
linear assembly |7-74| 
for inner loop |7-59 
with resources allocated 
translating C code to assembly 
instructions 
word access 
in dot product_[4-19]to[4-20] 
in FIR filter 
using for short data to [4-31] 


—z compiler option 
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